{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "54c8fe16",
   "metadata": {},
   "source": [
    "# Step by Step Semantic Deduplication on Text Data\n",
    "\n",
    "GPU accelerated implementation of [SemDeDup: Data-efficient learning at web-scale through semantic deduplication](https://arxiv.org/abs/2303.09540). For more information about semantic deduplication in NeMo Curator, refer to the [Semantic Deduplication](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/deduplication/semdedup.html) documentation page.\n",
    "\n",
    "The tutorial here shows how to run Semantic Duplication on text data by executing three workflows sequentially.\n",
    "\n",
    "We also use an ID Generator to show how it works when running it separately.\n",
    "\n",
    "1. Create ID generator\n",
    "2. Running embedding generation\n",
    "3. Running K-Means + pairwise (without duplicate identification)\n",
    "4. Run duplicate identification\n",
    "5. Run removal\n",
    "\n",
    "We also allow users to run these steps as a single workflow, which can be seen in the end to end tutorial in the same directory as this tutorial."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8a2c97d2",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "# Silence Curator logs via Loguru\n",
    "os.environ[\"LOGURU_LEVEL\"] = \"ERROR\"\n",
    "\n",
    "import pandas as pd\n",
    "\n",
    "input_path = os.path.abspath(\"./input\")\n",
    "semantic_out_dir = os.path.abspath(\"./output/step_by_step\")\n",
    "output_path = os.path.join(semantic_out_dir, \"output\")\n",
    "cache_path = os.path.join(semantic_out_dir, \"cache\")\n",
    "\n",
    "input_filetype = \"parquet\"  # this can be either of jsonl or parquet (you'll need to change how input data is generated and embedding generation reader to be jsonl)\n",
    "output_filetype = \"parquet\"  # this can be either of jsonl or parquet"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "985f8a13",
   "metadata": {},
   "source": [
    "## Generate Input Data\n",
    "\n",
    "We generate input data if we don't have files in the path above\n",
    " - We load the [TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories) dataset (just the train partition) which has 2,119,719 rows\n",
    " - We split into shards such that no shard has more than 10,000 rows\n",
    " - We create a new ID column which is UUID\n",
    " - We write out ~212 files"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "91a8ad78",
   "metadata": {},
   "outputs": [],
   "source": [
    "from nemo_curator.utils.file_utils import get_all_file_paths_under\n",
    "\n",
    "if len(get_all_file_paths_under(input_path)) == 0:\n",
    "    import os\n",
    "    import uuid\n",
    "\n",
    "    import numpy as np\n",
    "    from datasets import load_dataset\n",
    "\n",
    "    input_df = load_dataset(\"roneneldan/TinyStories\", split=\"train\").to_pandas()\n",
    "    num_rows_per_file = 10_000\n",
    "\n",
    "    os.makedirs(input_path, exist_ok=True)\n",
    "\n",
    "    for i, start_idx in enumerate(range(0, len(input_df), num_rows_per_file)):\n",
    "        end_idx = min(len(input_df), start_idx + num_rows_per_file)\n",
    "        subset_df = input_df.iloc[start_idx:end_idx].copy()\n",
    "        subset_df[\"id\"] = [str(uuid.uuid4()) for _ in range(len(subset_df))]\n",
    "        subset_df.to_parquet(os.path.join(input_path, f\"part_{i}.parquet\"), index=False)\n",
    "\n",
    "    print(f\"Created {len(os.listdir(input_path))} files\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9f9e2c0f",
   "metadata": {},
   "source": [
    "# Running Semantic Deduplication Workflow by Workflow\n",
    "\n",
    "Here we intentionally break it down into executing three workflows. \n",
    "\n",
    "We also use the ID Generator to show it works when running it separately.\n",
    "\n",
    "1. Create ID Generator.\n",
    "2. Running embedding generation\n",
    "3. Running K-Means + pairwise (without duplicate identification)\n",
    "4. Run duplicate identification\n",
    "5. Run removal\n",
    "\n",
    "Both steps 2 and 5 require the use of the ID Generator created in step 1.\n",
    "\n",
    "You might want to do this separately if\n",
    "1. You have a separate job that generates embeddings.\n",
    "2. You want to use different machine for different parts of the workflow. \n",
    "3. Your cluster limits how long a job can run for. \n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9235be51",
   "metadata": {},
   "outputs": [],
   "source": [
    "from nemo_curator.core.client import RayClient\n",
    "\n",
    "# Number of GPUs should be roughly 2x the memory of the embeddings\n",
    "client = RayClient(num_cpus=64, num_gpus=4)\n",
    "client.start()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "edbbe35f",
   "metadata": {},
   "source": [
    "## Create ID Generator\n",
    "\n",
    "This creates a Ray Actor in the background. When we read our dataset now, this actor in the background is used to assign monotonically increasing integer IDs to each row.\n",
    "\n",
    "See the [API Reference](https://docs.nvidia.com/nemo/curator/latest/apidocs/stages/stages.deduplication.id_generator.html#stages.deduplication.id_generator.create_id_generator_actor) for more information about the `create_id_generator_actor` function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "57b8e36b",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2025-09-16 13:44:04,277\tINFO worker.py:1630 -- Using address 127.0.1.1:6379 set in the environment variable RAY_ADDRESS\n",
      "2025-09-16 13:44:04,280\tINFO worker.py:1771 -- Connecting to existing Ray cluster at address: 127.0.1.1:6379...\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2025-09-16 13:44:04,342\tINFO usage_lib.py:447 -- Usage stats collection is disabled.\n",
      "2025-09-16 13:44:04,343\tINFO scripts.py:913 -- \u001b[37mLocal node IP\u001b[39m: \u001b[1m127.0.1.1\u001b[22m\n",
      "2025-09-16 13:44:07,176\tSUCC scripts.py:949 -- \u001b[32m--------------------\u001b[39m\n",
      "2025-09-16 13:44:07,176\tSUCC scripts.py:950 -- \u001b[32mRay runtime started.\u001b[39m\n",
      "2025-09-16 13:44:07,176\tSUCC scripts.py:951 -- \u001b[32m--------------------\u001b[39m\n",
      "2025-09-16 13:44:07,176\tINFO scripts.py:953 -- \u001b[36mNext steps\u001b[39m\n",
      "2025-09-16 13:44:07,176\tINFO scripts.py:956 -- To add another node to this Ray cluster, run\n",
      "2025-09-16 13:44:07,176\tINFO scripts.py:959 -- \u001b[1m  ray start --address='127.0.1.1:6379'\u001b[22m\n",
      "2025-09-16 13:44:07,176\tINFO scripts.py:968 -- To connect to this Ray cluster:\n",
      "2025-09-16 13:44:07,176\tINFO scripts.py:970 -- \u001b[35mimport\u001b[39m\u001b[26m ray\n",
      "2025-09-16 13:44:07,176\tINFO scripts.py:971 -- ray\u001b[35m.\u001b[39m\u001b[26minit(_node_ip_address\u001b[35m=\u001b[39m\u001b[26m\u001b[33m'127.0.1.1'\u001b[39m\u001b[26m)\n",
      "2025-09-16 13:44:07,177\tINFO scripts.py:983 -- To submit a Ray job using the Ray Jobs CLI:\n",
      "2025-09-16 13:44:07,177\tINFO scripts.py:984 -- \u001b[1m  RAY_API_SERVER_ADDRESS='http://127.0.0.1:8265' ray job submit --working-dir . -- python my_script.py\u001b[22m\n",
      "2025-09-16 13:44:07,177\tINFO scripts.py:993 -- See https://docs.ray.io/en/latest/cluster/running-applications/job-submission/index.html \n",
      "2025-09-16 13:44:07,177\tINFO scripts.py:997 -- for more information on submitting Ray jobs to the Ray cluster.\n",
      "2025-09-16 13:44:07,177\tINFO scripts.py:1002 -- To terminate the Ray runtime, run\n",
      "2025-09-16 13:44:07,177\tINFO scripts.py:1003 -- \u001b[1m  ray stop\u001b[22m\n",
      "2025-09-16 13:44:07,177\tINFO scripts.py:1006 -- To view the status of the cluster, use\n",
      "2025-09-16 13:44:07,177\tINFO scripts.py:1007 --   \u001b[1mray status\u001b[22m\u001b[26m\n",
      "2025-09-16 13:44:07,177\tINFO scripts.py:1011 -- To monitor and debug Ray, view the dashboard at \n",
      "2025-09-16 13:44:07,177\tINFO scripts.py:1012 --   \u001b[1m127.0.0.1:8265\u001b[22m\u001b[26m\n",
      "2025-09-16 13:44:07,177\tINFO scripts.py:1019 -- \u001b[4mIf connection to the dashboard fails, check your firewall settings and network configuration.\u001b[24m\n",
      "2025-09-16 13:44:07,177\tINFO scripts.py:1123 -- \u001b[36m\u001b[1m--block\u001b[22m\u001b[39m\n",
      "2025-09-16 13:44:07,177\tINFO scripts.py:1124 -- This command will now block forever until terminated by a signal.\n",
      "2025-09-16 13:44:07,177\tINFO scripts.py:1127 -- Running subprocesses are monitored and a message will be printed if any of them terminate unexpectedly. Subprocesses exit with SIGTERM will be treated as graceful, thus NOT reported.\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2025-09-16 13:44:07,377\tINFO worker.py:1942 -- Connected to Ray cluster. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n"
     ]
    }
   ],
   "source": [
    "from nemo_curator.stages.deduplication.id_generator import create_id_generator_actor\n",
    "\n",
    "create_id_generator_actor()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0d24e9ed",
   "metadata": {},
   "source": [
    "## Run Embedding Generation\n",
    "\n",
    "We output the embeddings as Parquet files so that we can read more smartly during our K-Means step. This is the recommended file format before you run K-Means.\n",
    "\n",
    "See the [API Reference](https://docs.nvidia.com/nemo/curator/latest/apidocs/stages/stages.text.embedders.base.html#stages.text.embedders.base.EmbeddingCreatorStage) for more information about the `EmbeddingCreatorStage` class."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "271b0a1e",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2025-09-16 13:44:12,448\tINFO worker.py:1630 -- Using address 127.0.1.1:6379 set in the environment variable RAY_ADDRESS\n",
      "2025-09-16 13:44:12,451\tINFO worker.py:1771 -- Connecting to existing Ray cluster at address: 127.0.1.1:6379...\n",
      "2025-09-16 13:44:12,457\tINFO worker.py:1942 -- Connected to Ray cluster. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n",
      "2025-09-16 13:44:12,471\tINFO worker.py:1630 -- Using address 127.0.1.1:6379 set in the environment variable RAY_ADDRESS\n",
      "2025-09-16 13:44:12,473\tINFO worker.py:1771 -- Connecting to existing Ray cluster at address: 127.0.1.1:6379...\n",
      "2025-09-16 13:44:12,473\tINFO worker.py:1789 -- Calling ray.init() again after it has already been called.\n",
      "Fetching 30 files: 100%|██████████| 30/30 [00:00<00:00, 9597.95it/s]\n",
      "Fetching 30 files: 100%|██████████| 30/30 [00:00<00:00, 31215.36it/s]\n"
     ]
    }
   ],
   "source": [
    "from nemo_curator.pipeline import Pipeline\n",
    "from nemo_curator.stages.text.embedders import EmbeddingCreatorStage\n",
    "from nemo_curator.stages.text.io.reader import ParquetReader\n",
    "from nemo_curator.stages.text.io.writer import ParquetWriter\n",
    "\n",
    "embedding_output_path = os.path.join(cache_path, \"embeddings\")\n",
    "\n",
    "embedding_pipeline = Pipeline(\n",
    "    name=\"embedding_pipeline\",\n",
    "    stages=[\n",
    "        # We specify _generate_ids=True to use the ID generator we created in step 1.\n",
    "        ParquetReader(file_paths=input_path, files_per_partition=1, fields=[\"text\"], _generate_ids=True),\n",
    "        EmbeddingCreatorStage(\n",
    "            model_identifier=\"sentence-transformers/all-MiniLM-L6-v2\",\n",
    "            text_field=\"text\",\n",
    "            max_seq_length=None,\n",
    "            max_chars=None,\n",
    "            embedding_pooling=\"mean_pooling\",\n",
    "            model_inference_batch_size=256,\n",
    "        ),\n",
    "        # We specify the fields out so that we also don't end up writing the `text` field, or the intermediate `input_ids` and `attention_mask` fields which are no longer needed.\n",
    "        ParquetWriter(path=embedding_output_path, fields=[\"_curator_dedup_id\", \"embeddings\"]),\n",
    "    ],\n",
    ")\n",
    "\n",
    "embedding_out = embedding_pipeline.run()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6f8044ba",
   "metadata": {},
   "source": [
    "### Save the ID Generator to disk and kill the actor"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "24bc3d06",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2025-09-16 13:48:10,787\tINFO worker.py:1630 -- Using address 127.0.1.1:6379 set in the environment variable RAY_ADDRESS\n",
      "2025-09-16 13:48:10,791\tINFO worker.py:1771 -- Connecting to existing Ray cluster at address: 127.0.1.1:6379...\n",
      "2025-09-16 13:48:10,800\tINFO worker.py:1942 -- Connected to Ray cluster. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[36m(KMeansReadFitWriteStage pid=2365182)\u001b[0m 203520000\n",
      "\u001b[36m(KMeansReadFitWriteStage pid=2365188)\u001b[0m 203520000\n",
      "\u001b[36m(KMeansReadFitWriteStage pid=2365179)\u001b[0m 203412096\n",
      "\u001b[36m(KMeansReadFitWriteStage pid=2365184)\u001b[0m 203520000\n"
     ]
    }
   ],
   "source": [
    "from nemo_curator.stages.deduplication.id_generator import kill_id_generator_actor, write_id_generator_to_disk\n",
    "\n",
    "id_generator_actor_path = os.path.join(output_path, \"semantic_id_generator.json\")\n",
    "write_id_generator_to_disk(id_generator_actor_path)\n",
    "kill_id_generator_actor()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7a600100",
   "metadata": {},
   "source": [
    "#### Embeddings Results\n",
    "\n",
    "1. `_curator_dedup_id` : The ID field generated by our `IdGenerator` and `ParquetReader`, since in our `ParquetReader` we specified `_generate_ids=True`.\n",
    "2. `embeddings` : The embedding generated by the model we used above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "736cd36a",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>_curator_dedup_id</th>\n",
       "      <th>embeddings</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1949719</td>\n",
       "      <td>[-0.12394736707210541, 0.010744917206466198, 0...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1949720</td>\n",
       "      <td>[-0.07273813337087631, 0.06685175746679306, 0....</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1949721</td>\n",
       "      <td>[-0.04823765903711319, 0.11327654868364334, -0...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1949722</td>\n",
       "      <td>[-0.08059918135404587, 0.024182168766856194, -...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1949723</td>\n",
       "      <td>[-0.031761009246110916, 0.02613956294953823, 0...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   _curator_dedup_id                                         embeddings\n",
       "0            1949719  [-0.12394736707210541, 0.010744917206466198, 0...\n",
       "1            1949720  [-0.07273813337087631, 0.06685175746679306, 0....\n",
       "2            1949721  [-0.04823765903711319, 0.11327654868364334, -0...\n",
       "3            1949722  [-0.08059918135404587, 0.024182168766856194, -...\n",
       "4            1949723  [-0.031761009246110916, 0.02613956294953823, 0..."
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "embeddings_path = os.path.join(cache_path, \"embeddings\")\n",
    "\n",
    "pd.read_parquet(os.path.join(embeddings_path, os.listdir(embeddings_path)[0])).head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8bec854a",
   "metadata": {},
   "source": [
    "## Run Semantic Deduplication workflow (without specifying `eps`)\n",
    "\n",
    "We intentionally don't specify `eps` so that we can show how to run `IdentifyDuplicates` as a separate stage.\n",
    "\n",
    "See the [API Reference](https://docs.nvidia.com/nemo/curator/latest/apidocs/stages/stages.deduplication.semantic.workflow.html#stages.deduplication.semantic.workflow.SemanticDeduplicationWorkflow) for more information about the `SemanticDeduplicationWorkflow` class."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e069a223",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2025-09-16 13:48:14,425\tINFO worker.py:1630 -- Using address 127.0.1.1:6379 set in the environment variable RAY_ADDRESS\n",
      "2025-09-16 13:48:14,426\tINFO worker.py:1771 -- Connecting to existing Ray cluster at address: 127.0.1.1:6379...\n",
      "2025-09-16 13:48:14,427\tINFO worker.py:1789 -- Calling ray.init() again after it has already been called.\n",
      "2025-09-16 13:48:40,029\tINFO worker.py:1630 -- Using address 127.0.1.1:6379 set in the environment variable RAY_ADDRESS\n",
      "2025-09-16 13:48:40,031\tINFO worker.py:1771 -- Connecting to existing Ray cluster at address: 127.0.1.1:6379...\n",
      "2025-09-16 13:48:40,039\tINFO worker.py:1942 -- Connected to Ray cluster. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n",
      "2025-09-16 13:48:40,054\tINFO worker.py:1630 -- Using address 127.0.1.1:6379 set in the environment variable RAY_ADDRESS\n",
      "2025-09-16 13:48:40,056\tINFO worker.py:1771 -- Connecting to existing Ray cluster at address: 127.0.1.1:6379...\n",
      "2025-09-16 13:48:40,056\tINFO worker.py:1789 -- Calling ray.init() again after it has already been called.\n"
     ]
    }
   ],
   "source": [
    "from nemo_curator.stages.deduplication.semantic import RankingStrategy, SemanticDeduplicationWorkflow\n",
    "\n",
    "semantic_workflow_path = os.path.join(cache_path, \"semantic_dedup\")\n",
    "\n",
    "workflow = SemanticDeduplicationWorkflow(\n",
    "    input_path=embedding_output_path,\n",
    "    output_path=semantic_workflow_path,\n",
    "    n_clusters=100,\n",
    "    # Since we use Id Generator in the embedding generation step, we need to specify the ID field as `_curator_dedup_id`\n",
    "    id_field=\"_curator_dedup_id\",\n",
    "    embedding_field=\"embeddings\",\n",
    "    ranking_strategy=RankingStrategy(metadata_cols=[\"cosine_dist_to_cent\"], ascending=True),\n",
    "    # if eps is specified then it'll also run IdentifyDuplicates stage\n",
    "    eps=None,\n",
    ")\n",
    "semantic_out = workflow.run()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a8743b9f",
   "metadata": {},
   "source": [
    "#### K-Means Results\n",
    "\n",
    "1. `_curator_dedup_id` : The IDs of the rows that belong to the cluster.\n",
    "2. `embeddings` : These are later used for pairwise similarity.\n",
    "3. `l2_dist_to_cent` / `cosine_dist_to_cent` : This represents how far (l2 distance or cosine distance) a sample is from our cluster's centroid.\n",
    "    - These fields help us define how we want to prioritize ranking within our cluster. See `RankingStrategy`\n",
    "    - If we had other `metadata_fields` provided they would be used here instead.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "f7527710",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>_curator_dedup_id</th>\n",
       "      <th>embeddings</th>\n",
       "      <th>l2_dist_to_cent</th>\n",
       "      <th>cosine_dist_to_cent</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>280260</td>\n",
       "      <td>[-0.051341612, 0.023956891, 0.0818636, 0.01455...</td>\n",
       "      <td>0.564816</td>\n",
       "      <td>0.174425</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>280366</td>\n",
       "      <td>[-0.029556809, 0.040268984, 0.13631389, -0.001...</td>\n",
       "      <td>0.677134</td>\n",
       "      <td>0.261470</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>280795</td>\n",
       "      <td>[-0.08078232, 0.032830615, 0.10732673, -0.0177...</td>\n",
       "      <td>0.545988</td>\n",
       "      <td>0.161375</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>280814</td>\n",
       "      <td>[-0.016866645, 0.03228915, 0.021343842, 0.0527...</td>\n",
       "      <td>0.599459</td>\n",
       "      <td>0.199594</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>281333</td>\n",
       "      <td>[-0.04058464, 0.023736855, 0.09525293, 0.05758...</td>\n",
       "      <td>0.553146</td>\n",
       "      <td>0.166285</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   _curator_dedup_id                                         embeddings  \\\n",
       "0             280260  [-0.051341612, 0.023956891, 0.0818636, 0.01455...   \n",
       "1             280366  [-0.029556809, 0.040268984, 0.13631389, -0.001...   \n",
       "2             280795  [-0.08078232, 0.032830615, 0.10732673, -0.0177...   \n",
       "3             280814  [-0.016866645, 0.03228915, 0.021343842, 0.0527...   \n",
       "4             281333  [-0.04058464, 0.023736855, 0.09525293, 0.05758...   \n",
       "\n",
       "   l2_dist_to_cent  cosine_dist_to_cent  \n",
       "0         0.564816             0.174425  \n",
       "1         0.677134             0.261470  \n",
       "2         0.545988             0.161375  \n",
       "3         0.599459             0.199594  \n",
       "4         0.553146             0.166285  "
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "kmeans_path_first_centroid = os.path.join(semantic_workflow_path, \"kmeans_results\", \"centroid=0\")\n",
    "\n",
    "pd.read_parquet(os.path.join(kmeans_path_first_centroid, os.listdir(kmeans_path_first_centroid)[0])).head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9a8e14ef",
   "metadata": {},
   "source": [
    "#### Pairwise Similarity Result\n",
    "\n",
    "1. `id` : The identifier for the duplicate row.\n",
    "2. `max_id` : The closest pair for the duplicate row.\n",
    "3. `cosine_sim_score` : The cosine similarity between the two points.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "db2cf58f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>max_id</th>\n",
       "      <th>cosine_sim_score</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1806089</td>\n",
       "      <td>1806089</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1339989</td>\n",
       "      <td>1806089</td>\n",
       "      <td>0.932138</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>699085</td>\n",
       "      <td>1806089</td>\n",
       "      <td>0.936246</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1220322</td>\n",
       "      <td>1806089</td>\n",
       "      <td>0.915550</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1948996</td>\n",
       "      <td>1806089</td>\n",
       "      <td>0.928383</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        id   max_id  cosine_sim_score\n",
       "0  1806089  1806089          0.000000\n",
       "1  1339989  1806089          0.932138\n",
       "2   699085  1806089          0.936246\n",
       "3  1220322  1806089          0.915550\n",
       "4  1948996  1806089          0.928383"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pairwise_path = os.path.join(cache_path, \"semantic_dedup\", \"pairwise_results\")\n",
    "\n",
    "pd.read_parquet(os.path.join(pairwise_path, \"cluster_0.parquet\")).head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1dec1875",
   "metadata": {},
   "source": [
    "#### Investigate Results of Semantic Workflow\n",
    "\n",
    "Depending on our dataset size we can read through all of the files and plot how much data is similar to one another.\n",
    "Here we show how to read file by file and then perform a reduce. \n",
    "\n",
    "Based on the analysis here we can decide what our `eps` parameter should be, proceed to identify the duplicates, and finally perform removal"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "32d68935",
   "metadata": {},
   "outputs": [],
   "source": [
    "from collections import Counter\n",
    "from functools import reduce\n",
    "\n",
    "import numpy as np\n",
    "\n",
    "pairwise_path = os.path.join(semantic_workflow_path, \"pairwise_results\")\n",
    "\n",
    "\n",
    "def get_bins(df: pd.DataFrame, num_bins: int = 1_000) -> dict[float, int]:\n",
    "    bins = np.linspace(0, 1.01, num_bins)\n",
    "\n",
    "    return Counter(\n",
    "        pd.cut(df[\"cosine_sim_score\"], bins=bins, labels=bins[1:], retbins=False, include_lowest=True, right=True)\n",
    "        .value_counts()\n",
    "        .to_dict()\n",
    "    )\n",
    "\n",
    "\n",
    "similarity_across_dataset = reduce(\n",
    "    lambda x, y: x + y,\n",
    "    [\n",
    "        get_bins(pd.read_parquet(os.path.join(pairwise_path, f), columns=[\"cosine_sim_score\"]), num_bins=1000)\n",
    "        for f in os.listdir(pairwise_path)\n",
    "    ],\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0cc077ee",
   "metadata": {},
   "source": [
    "Looking at the graph below we see 20% of our dataset is above 0.9 cosine similarity. So for the purpose of this tutorial we can use 0.1 (1 - 0.9) as our `eps` parameter."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "fdf55383",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAjcAAAG2CAYAAACDLKdOAAAAOnRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjEwLjMsIGh0dHBzOi8vbWF0cGxvdGxpYi5vcmcvZiW1igAAAAlwSFlzAAAPYQAAD2EBqD+naQAAWkFJREFUeJzt3XtYlNXePvB7QGYQBUGRg0SCp9RUINjyohmWKKXbQ7XTopQw3TuVrTkZiQcQNTFTxNSiPGeZdjDbpZskDMpEMJS2eSpPYSp4SlDQYWDW7w9+TI4gzjPOmftzXVwvs+aZNTcj7+bbWt/neWRCCAEiIiIiO+Fg6QBERERExsTihoiIiOwKixsiIiKyKyxuiIiIyK6wuCEiIiK7wuKGiIiI7AqLGyIiIrIrLG6IiIjIrrC4ISIiIrvC4oaIiIjsikWLm++//x5Dhw5Fu3btIJPJsG3btru+JicnBw899BAUCgU6deqE9evXmzwnERER2Q6LFjcVFRUICgrCypUr9Tr+1KlTGDJkCB599FEUFRXhlVdewbhx4/DNN9+YOCkRERHZCpm13DhTJpPhiy++wIgRI+54zOuvv47t27fjl19+0Y49++yzuHr1KjIzM82QkoiIiKxdM0sHkCIvLw9RUVE6Y9HR0XjllVfu+BqVSgWVSqV9rNFocOXKFbRp0wYymcxUUYmIiMiIhBC4du0a2rVrBweHxjeebKq4KSkpgbe3t86Yt7c3ysvLcePGDTRv3rzea1JTU5GSkmKuiERERGRCZ86cwX333dfoMTZV3BgiMTERSqVS+7isrAz3338/Tp06BVdXV6O9j1qtxnfffYdHH30UTk5ORpvXlGwxM2CbuZnZPJjZPJjZPGwxc3nFTUQu3QMA+DHhEbjIjVdmXLt2DYGBgXr97bap4sbHxwelpaU6Y6WlpXBzc2tw1QYAFAoFFApFvfHWrVvDzc3NaNnUajVcXFzQpk0bm/kltMXMgG3mZmbzYGbzYGbzsMXMzZxvwEHhAgBo06aNUYubus9An5YSm7rOTUREBLKzs3XGsrKyEBERYaFEREREZG0sWtxcv34dRUVFKCoqAlB7qndRURGKi4sB1G4pjRkzRnv8yy+/jJMnTyIhIQFHjx7FO++8g08++QRTp061RHwiIiKyQhYtbn766SeEhIQgJCQEAKBUKhESEoKkpCQAwPnz57WFDgAEBgZi+/btyMrKQlBQEJYsWYLVq1cjOjraIvmJiIjI+li056Z///5o7DI7DV19uH///jhw4IAJUxEREZEts6meGyIiIqK7YXFDREREdoXFDREREdkVFjdERERkV1jcEBERkV1hcUNERER2hcUNERER2RUWN0RERGRXWNwQERGRXWFxQ0RERHaFxQ0RERHZFRY3REREZFdY3BAREZFdYXFDREREdoXFDREREdkVFjdERERkV1jcEBERkV1hcUNERER2hcUNERER2RUWN0RERGRXWNwQERGRXWFxQ0RERHaFxQ0RERHZFRY3REREZFdY3BAREZFdYXFDREREdsXixc3KlSsREBAAZ2dnhIeHo6Cg4I7HqtVqzJ07Fx07doSzszOCgoKQmZlpxrRERERk7Sxa3GzZsgVKpRLJycnYv38/goKCEB0djQsXLjR4/KxZs/Dee+9h+fLlOHz4MF5++WU8+eSTOHDggJmTExERkbWyaHGTlpaG8ePHIy4uDt27d0dGRgZcXFywdu3aBo/fuHEjZsyYgcGDB6NDhw6YMGECBg8ejCVLlpg5OREREVmrZpZ646qqKhQWFiIxMVE75uDggKioKOTl5TX4GpVKBWdnZ52x5s2bY/fu3Xd8H5VKBZVKpX1cXl4OoHaLS61W38uPoKNuLmPOaWq2mBmwzdzMbB7MbB7MbB62mbn6lu/VUMuEEefW/3OQCSGM984SnDt3Dn5+ftizZw8iIiK04wkJCcjNzUV+fn6918TExODnn3/Gtm3b0LFjR2RnZ2P48OGoqanRKWBuNWfOHKSkpNQb37RpE1xcXIz3AxERETVxqhogoaB23WRR72ooHI03d2VlJWJiYlBWVgY3N7dGj7XYyo0hli1bhvHjx6Nr166QyWTo2LEj4uLi7riNBQCJiYlQKpXax+Xl5fD398egQYPu+uFIoVarkZWVhYEDB8LJyclo85qSLWYGbDM3M5sHM5sHM5uHLWYuq7gJFHwPAIiOHgQXufHKjLqdF33c07vevHmz3jaRvjw9PeHo6IjS0lKd8dLSUvj4+DT4mrZt22Lbtm24efMmLl++jHbt2mH69Ono0KHDHd9HoVBAoVDUG3dycjLJL4up5jUlW8wM2GZuZjYPZjYPZjYPW8rs5FR9y/dOcHIyXnEj5TOQ3FCs0Wgwb948+Pn5oWXLljh58iQAYPbs2VizZo3e88jlcoSGhiI7O1tn7uzsbJ1tqoY4OzvDz88P1dXV+PzzzzF8+HCpPwYRERHZKcnFzfz587F+/XosWrQIcrlcO96jRw+sXr1a0lxKpRKrVq3Chg0bcOTIEUyYMAEVFRWIi4sDAIwZM0an4Tg/Px9bt27FyZMn8cMPP+Dxxx+HRqNBQkKC1B+DiIiI7JTk9aIPPvgA77//PgYMGICXX35ZOx4UFISjR49KmmvUqFG4ePEikpKSUFJSguDgYGRmZsLb2xsAUFxcDAeHv+qvmzdvYtasWTh58iRatmyJwYMHY+PGjXB3d5f6YxAREZGdklzcnD17Fp06dao3rtFoDDpdLT4+HvHx8Q0+l5OTo/M4MjIShw8flvweRERE1HRI3pbq3r07fvjhh3rjn332GUJCQowSioiIiMhQkldukpKSEBsbi7Nnz0Kj0WDr1q04duwYPvjgA3z99demyEhERESkN8krN8OHD8dXX32Fb7/9Fi1atEBSUhKOHDmCr776CgMHDjRFRiIiIiK9SVq5qa6uxoIFCzB27FhkZWWZKhMRERGRwSSt3DRr1gyLFi1CdXX13Q8mIiIisgDJ21IDBgxAbm6uKbIQERER3TPJDcVPPPEEpk+fjoMHDyI0NBQtWrTQeX7YsGFGC0dEREQkleTiZuLEiQCAtLS0es/JZDLU1NTceyoiIiIiA0kubjQajSlyEBERERmF5J4bIiIiImtm0L3Ic3NzsXjxYhw5cgRA7VWLX3vtNfTr18+o4YiIiMg4hBC4oTZt68iNKutoTZFc3Hz44YeIi4vDU089hcmTJwMAfvzxRwwYMADr169HTEyM0UMSERGRNLcWM0IAz2Tk4fD5cgunMg/Jxc0bb7yBRYsWYerUqdqxyZMnIy0tDfPmzWNxQ0REZCF1BY2li5nQ+93R3MnRIu8NGFDcnDx5EkOHDq03PmzYMMyYMcMooYiIiEgajUbg78t3N1rQdPd1w6cvR0AmM00GtVqNb77ZiRF//xtkpnoTPUgubvz9/ZGdnY1OnTrpjH/77bfw9/c3WjAiIiLSj0YjMCAtF6cuVeiM317MNHdyNGnRoZYJKBxh0cIGMKC4efXVVzF58mQUFRWhT58+AGp7btavX49ly5YZPSARERE1TAiByqoa/H35bm1hE+jZAl//+2HIZKYvZqyV5OJmwoQJ8PHxwZIlS/DJJ58AALp164YtW7Zg+PDhRg9IRERE9TW0DRXo2QLZykg4ODS9guZWBp0K/uSTT+LJJ580dhYiIiLSQ0PbUN193fD1vx9u8oUNYEBxs2/fPmg0GoSHh+uM5+fnw9HREWFhYUYLR0RERLqEEA1uQ7nIm+YWVEMkX6F40qRJOHPmTL3xs2fPYtKkSUYJRURERA2rrKrRbkXVbUO1UDRjYXMLycXN4cOH8dBDD9UbDwkJweHDh40SioiIiOqr67Opw22ohkkubhQKBUpLS+uNnz9/Hs2aGdTCQ0RERHdx+3ZUd183uMgtd6E8aya5uBk0aBASExNRVlamHbt69SpmzJiBgQMHGjUcERER1bp9O6r2dG+u2jRE8lLL4sWL8cgjj6B9+/YICQkBABQVFcHb2xsbN240ekAiIqKmTgiBZzLytI+5HdU4ycWNn58f/ve//+Gjjz7Czz//jObNmyMuLg7PPfccnJycTJGRiIioSbt11YbbUXdnUJNMixYt8M9//tPYWYiIiOg2tU3EP2of195Ogas2jZHcc7NhwwZs375d+zghIQHu7u7o06cPfv/9d8kBVq5ciYCAADg7OyM8PBwFBQWNHp+eno4HHngAzZs3h7+/P6ZOnYqbN29Kfl8iIiJrJwQw4t29bCKWSHJxs2DBAjRv3hwAkJeXhxUrVmDRokXw9PTE1KlTJc21ZcsWKJVKJCcnY//+/QgKCkJ0dDQuXLjQ4PGbNm3C9OnTkZycjCNHjmDNmjXYsmUL70ZORER2qUoDHCm5BoBNxFJILm7OnDmjvSP4tm3b8I9//AP//Oc/kZqaih9++EHSXGlpaRg/fjzi4uLQvXt3ZGRkwMXFBWvXrm3w+D179qBv376IiYlBQEAABg0ahOeee+6uqz1ERES2jk3E+pPcc9OyZUtcvnwZ999/P3bu3AmlUgkAcHZ2xo0bN/Sep6qqCoWFhUhMTNSOOTg4ICoqCnl5eQ2+pk+fPvjwww9RUFCA3r174+TJk9ixYwdGjx59x/dRqVRQqVTax+XltQ1ZarUaarVa77x3UzeXMec0NVvMDNhmbmY2D2Y2D2Y2j6qqKiz75a8tqOpqNdQOwoKJ7s6Un7OUOWVCCEmf1PPPP4+jR48iJCQEH3/8MYqLi9GmTRv85z//wYwZM/DLL7/oNc+5c+fg5+eHPXv2ICIiQjuekJCA3Nxc5OfnN/i6t99+G9OmTYMQAtXV1Xj55Zfx7rvv3vF95syZg5SUlHrjmzZtgouLi15ZiYiIzE1VAyQU1K5B+LkIvNarBk15R6qyshIxMTEoKyuDm5tbo8dKXrlZuXIlZs2ahTNnzuDzzz9HmzZtAACFhYV47rnnDEusp5ycHCxYsADvvPMOwsPDcfz4cUyZMgXz5s3D7NmzG3xNYmKidnUJqF258ff3x6BBg+764UihVquRlZWFgQMH2swp8baYGbDN3MxsHsxsHsxsekIIDHsnD8B1AMB25QC0UFj/XQBM+TnX7bzoQ/In5e7ujhUrVtQbb2h1pDGenp5wdHSsdyuH0tJS+Pj4NPia2bNnY/To0Rg3bhwAoGfPnqioqMA///lPzJw5Ew4O9VuIFAoFFApFvXEnJyeT/IKbal5TssXMgG3mZmbzYGbzYGbTqVBV42hJbWHTzccVrVo421QjsSk+ZynzSW4oNha5XI7Q0FBkZ2drxzQaDbKzs3W2qW5VWVlZr4BxdKzdj5S4u0ZERGSVbr8a8cfj/mZThY01sOgal1KpRGxsLMLCwtC7d2+kp6ejoqICcXFxAIAxY8bAz88PqampAIChQ4ciLS0NISEh2m2p2bNnY+jQodoih4iIyJbdUP91NWI/F8Hr2hjAosXNqFGjcPHiRSQlJaGkpATBwcHIzMyEt7c3AKC4uFhnpWbWrFmQyWSYNWsWzp49i7Zt22Lo0KF44403LPUjEBERGdWtGxFTetRw1cYAFu9Oio+PR3x8fIPP5eTk6Dxu1qwZkpOTkZycbIZkRERE5nX7lhQZRnLPTXJyskG3WSAiIqLG3XqDzG4+rpBbrDPWtkn+2L788kt07NgRAwYMwKZNm3QukEdERESGabiR2IKBbJjk4qaoqAj79u3Dgw8+iClTpsDHxwcTJkzAvn37TJGPiIioSbi1kZg3yLw3Bi14hYSE4O2338a5c+ewZs0a/PHHH+jbty969eqFZcuWoayszNg5iYiI7NqtjcSfvhzBRuJ7cE+7eUIIqNVqVFVVQQgBDw8PrFixAv7+/tiyZYuxMhIREdm127ekWNfcG4OKm8LCQsTHx8PX1xdTp05FSEgIjhw5gtzcXPz222944403MHnyZGNnJSIisku3b0k1d+KW1L2QXNz07NkT//d//4dTp05hzZo1OHPmDBYuXIhOnTppj3nuuedw8eJFowYlIiJqCrglde8kX+dm5MiRGDt2LPz8/O54jKenJzQazT0FIyIiaipu7bdhXXPvJK/c1PXW3O7GjRuYO3euUUIRERE1Fbxwn/FJLm5SUlJw/fr1euOVlZWS7wxORETU1LHfxvgMWrlpaC/w559/RuvWrY0SioiIqKngKeDGp3fPjYeHB2QyGWQyGbp06aLz4dfU1OD69et4+eWXTRKSiIjIHvEUcNPQu7hJT0+HEAJjx45FSkoKWrVqpX1OLpcjICAAERERJglJRERkj7glZRp6FzexsbEAgMDAQPTp0wdOTk4mC0VERNTUcEvKePQqbsrLy+Hm5gag9tYLN27cwI0bNxo8tu44IiIiahxPATcNvYobDw8PnD9/Hl5eXnB3d2+wsqxrNK6pqTF6SCIiInvDU8BNR6/iZteuXdozob777juTBiIiImoK2G9jOnoVN5GRkQCA6upq5ObmYuzYsbjvvvtMGoyIiMie8RRw05F0nZtmzZrhrbfeQnV1tanyEBER2T2eAm5aki/i99hjjyE3N9cUWYiIiJoEbkmZluQbZz7xxBOYPn06Dh48iNDQULRo0ULn+WHDhhktHBERkT3ilpRpSS5uJk6cCABIS0ur9xzPliIiImoct6RMT3Jxo9FoTJGDiIioSeCWlOlJ7rkhIiIi4+CWlGlIXrkBgIqKCuTm5qK4uBhVVVU6z02ePNkowYiIiOwd6xrTkFzcHDhwAIMHD0ZlZSUqKirQunVrXLp0CS4uLvDy8mJxQ0RERBYleVtq6tSpGDp0KP788080b94ce/fuxe+//47Q0FAsXrzYoBArV65EQEAAnJ2dER4ejoKCgjse279/f8hksnpfQ4YMMei9iYiIzOnWM6XINCQXN0VFRXj11Vfh4OAAR0dHqFQq+Pv7Y9GiRZgxY4bkAFu2bIFSqURycjL279+PoKAgREdH48KFCw0ev3XrVpw/f1779csvv8DR0RHPPPOM5PcmIiIyJ95PyjwkFzdOTk5wcKh9mZeXF4qLiwEArVq1wpkzZyQHSEtLw/jx4xEXF4fu3bsjIyMDLi4uWLt2bYPHt27dGj4+PtqvrKwsuLi4sLghIiKrxzOlzENyz01ISAj27duHzp07IzIyEklJSbh06RI2btyIHj16SJqrqqoKhYWFSExM1I45ODggKioKeXn6VbZr1qzBs88+W+9ignVUKhVUKpX2cXl57S+VWq2GWq2WlLcxdXMZc05Ts8XMgG3mZmbzYGbzYOZ7yfHX7Ys2vRTW6O2MrCWzFKbMLGVOmRDSdv9++uknXLt2DY8++iguXLiAMWPGYM+ePejcuTPWrl2LoKAgvec6d+4c/Pz8sGfPHkRERGjHExISkJubi/z8/EZfX1BQgPDwcOTn56N3794NHjNnzhykpKTUG9+0aRNcXFz0zkpERHSvVDVAQkHtusKi3tVQcOFGb5WVlYiJiUFZWRnc3NwaPVbyyk1YWJj2ey8vL2RmZkpPaCRr1qxBz54971jYAEBiYiKUSqX2cXl5Ofz9/TFo0KC7fjhSqNVqZGVlYeDAgXBycjLavKZki5kB28zNzObBzObBzIYRQmD4O3sBXAMAREcPgov8zn+GrSGzVKbMXLfzog+DrnNjLJ6ennB0dERpaanOeGlpKXx8fBp9bUVFBTZv3oy5c+c2epxCoYBCoag37uTkZJJfFlPNa0q2mBmwzdzMbB7MbB7MLE1lVTWOlNQWNt193eDm4qzXBfz4Of81p770Km5CQkL0voLi/v379X5zuVyO0NBQZGdnY8SIEQBqb++QnZ2N+Pj4Rl/76aefQqVS4YUXXtD7/YiIiKwBr0xsWnoVN3WFhykolUrExsYiLCwMvXv3Rnp6OioqKhAXFwcAGDNmDPz8/JCamqrzujVr1mDEiBFo06aNybIREREZy60drqxrTEuv4iY5OdlkAUaNGoWLFy8iKSkJJSUlCA4ORmZmJry9vQEAxcXF2lPP6xw7dgy7d+/Gzp07TZaLiIjIWHh9G/OyaM9Nnfj4+DtuQ+Xk5NQbe+CBByDxJC8iIiKL4fVtzEuv4qZ169b49ddf4enpCQ8Pj0b3Ca9cuWK0cERERPaG/Tamp1dxs3TpUri6ugIA0tPTTZmHiIjI7rDfxrz0Km5iY2Mb/J6IiIgax34b8zO45+bChQu4cOECNBqNznivXr3uORQREZG9YL+N+UkubgoLCxEbG4sjR47Ua+qVyWSoqakxWjgiIiJ7wn4b85Bc3IwdOxZdunTBmjVr4O3tzX8kIiKiRrDfxvwkFzcnT57E559/jk6dOpkiDxERkd1gv41lONz9EF0DBgzAzz//bIosREREdoX9NpYheeVm9erViI2NxS+//IIePXrUu5HVsGHDjBaOiIjIXrDfxnwkFzd5eXn48ccf8d///rfec2woJiIiahjrGvORvC3173//Gy+88ALOnz8PjUaj88XChoiIiCxNcnFz+fJlTJ06VXtjSyIiIiJrIrm4eeqpp/Ddd9+ZIgsREZFd4T2eLUNyz02XLl2QmJiI3bt3o2fPnvUaiidPnmy0cERERLaKp4FbjkFnS7Vs2RK5ubnIzc3VeU4mk7G4ISIiAk8DtyTJxc2pU6dMkYOIiMhu8TRw85Lcc0NERER3x9suWI5eKzdKpRLz5s1DixYtoFQqGz02LS3NKMGIiIhsFfttLEuv4ubAgQNQq9Xa7++ES25ERETst7E0vYqbW0/95mngRERE+mO/jfndc89NeXk5tm3bhqNHjxojDxERkV1hXWN+koubkSNHYsWKFQCAGzduICwsDCNHjkTPnj3x+eefGz0gERGRreHF+yxLcnHz/fffo1+/fgCAL774AkIIXL16FW+//Tbmz59v9IBERES2hM3Elie5uCkrK0Pr1q0BAJmZmXj66afh4uKCIUOG4LfffjN6QCIiIlvCZmLLk1zc+Pv7Iy8vDxUVFcjMzMSgQYMAAH/++SecnZ2NHpCIiMhWsZnYMiRfofiVV17B888/j5YtW6J9+/bo378/gNrtqp49exo7HxERkc1iXWMZklduJk6ciL1792Lt2rXYvXs3HBxqp+jQoYNBPTcrV65EQEAAnJ2dER4ejoKCgkaPv3r1KiZNmgRfX18oFAp06dIFO3bskPy+REREpsBmYsuTvHIDAKGhoQgNDdUZGzJkiOR5tmzZAqVSiYyMDISHhyM9PR3R0dE4duwYvLy86h1fVVWFgQMHwsvLC5999hn8/Pzw+++/w93d3ZAfg4iIyKjYTGwdDCpujCUtLQ3jx49HXFwcACAjIwPbt2/H2rVrMX369HrHr127FleuXMGePXvg5OQEAAgICDBnZCIiojtiM7F1sFhxU1VVhcLCQiQmJmrHHBwcEBUVhby8hqve//znP4iIiMCkSZPw5Zdfom3btoiJicHrr78OR8eGf4FUKhVUKpX2cXl57S+dWq3W3lLCGOrmMuacpmaLmQHbzM3M5sHM5sHMjb1Ptfb7TS+Fobq6upGj7zYXP+eG5taHTAjL7A6eO3cOfn5+2LNnDyIiIrTjCQkJyM3NRX5+fr3XdO3aFadPn8bzzz+PiRMn4vjx45g4cSImT56M5OTkBt9nzpw5SElJqTe+adMmuLi4GO8HIiKiJk9VAyQU1K4bLOpdDQUXboymsrISMTExKCsrg5ubW6PHWnRbSiqNRgMvLy+8//77cHR0RGhoKM6ePYu33nrrjsVNYmKizp3My8vL4e/vj0GDBt31w5FCrVYjKysLAwcO1G6ZWTtbzAzYZm5mNg9mNg9mvrMKVTUSCnYBAKKjB8FFbvifWX7Ouup2XvRh0Kf+ww8/4L333sOJEye0jb0bN25EYGAgHn74Yb3m8PT0hKOjI0pLS3XGS0tL4ePj0+BrfH194eTkpLMF1a1bN5SUlKCqqgpyubzeaxQKBRQKRb1xJycnk/yymGpeU7LFzIBt5mZm82Bm82BmXUIIxLyz97b3uvc1BH7Of82pL8mngn/++eeIjo5G8+bNceDAAW0/S1lZGRYsWKD3PHK5HKGhocjOztaOaTQaZGdn62xT3apv3744fvw4NBqNduzXX3+Fr69vg4UNERGRubCZ2HpILm7mz5+PjIwMrFq1SqeK6tu3L/bv3y9pLqVSiVWrVmHDhg04cuQIJkyYgIqKCu3ZU2PGjNFpOJ4wYQKuXLmCKVOm4Ndff8X27duxYMECTJo0SeqPQUREZDK8MrFlSV4vO3bsGB555JF6461atcLVq1clzTVq1ChcvHgRSUlJKCkpQXBwMDIzM+Ht7Q0AKC4u1l4kEKi99cM333yDqVOnolevXvDz88OUKVPw+uuvS/0xiIiITIZ1jWVJLm58fHxw/PjxeteX2b17Nzp06CA5QHx8POLj4xt8Licnp95YREQE9u7dW/9gIiIiIhiwLTV+/HhMmTIF+fn5kMlkOHfuHD766CNMmzYNEyZMMEVGIiIiIr1JXrmZPn06NBoNBgwYgMrKSjzyyCNQKBSYNm0a/v3vf5siIxERkdXjPaWsh+TiRiaTYebMmXjttddw/PhxXL9+Hd27d0fLli1NkY+IiMjq8Z5S1sXgE/Dlcjm6d+9uzCxEREQ2iaeBWxfJxU1FRQUWLlyI7OxsXLhwQeeaMwBw8uRJo4UjIiKyNTwN3PIkFzfjxo1Dbm4uRo8eDV9fX/4DEhER3YJ/Fi1PcnHz3//+F9u3b0ffvn1NkYeIiMjmsJnYukg+FdzDwwOtW7c2RRYiIiKbw2Zi6yO5uJk3bx6SkpJQWVlpijxEREQ2hc3E1kevbamQkBCd3prjx4/D29sbAQEB9e7SKfX+UkRERPaCzcTWQa/iZsSIESaOQUREZPtY11gHvYqb5ORkU+cgIiKySWwmtj6Se246dOiAy5cv1xu/evWqQTfOJCIislVsJrZOkoub06dPo6ampt64SqXCH3/8YZRQREREtoDNxNZJ7+vc/Oc//9F+/80336BVq1baxzU1NcjOzkZgYKBx0xEREdkINhNbD72Lm7qmYplMhtjYWJ3nnJycEBAQgCVLlhg1HBERka1gXWM99C5u6u4hFRgYiH379sHT09NkoYiIiGwBm4mtk+TbL5w6dcoUOYiIiGwKm4mtl+SGYiIiImIzsTVjcUNERHSP2ExsXVjcEBER3SPWNdaFxQ0REZEB2ExsvSQXN2PGjMG6detw4sQJU+QhIiKyemwmtm6Sixu5XI7U1FR07twZ/v7+eOGFF7B69Wr89ttvpshHRERkddhMbN0kFzerV6/Gr7/+ijNnzmDRokVo2bIllixZgq5du+K+++4zRUYiIiKrxWZi62Nwz42HhwfatGkDDw8PuLu7o1mzZmjbtq0xsxEREVk91jXWR3JxM2PGDPTp0wdt2rTB9OnTcfPmTUyfPh0lJSU4cOCAQSFWrlyJgIAAODs7Izw8HAUFBXc8dv369ZDJZDpfzs7OBr0vERER2R/JVyheuHAh2rZti+TkZDz11FPo0qXLPQXYsmULlEolMjIyEB4ejvT0dERHR+PYsWPw8vJq8DVubm44duyY9jGXA4mIiKiO5JWbAwcOYObMmSgoKEDfvn3h5+eHmJgYvP/++/j1118lB0hLS8P48eMRFxeH7t27IyMjAy4uLli7du0dXyOTyeDj46P98vb2lvy+REREZJ8kr9wEBQUhKCgIkydPBgD8/PPPWLp0KSZNmgSNRoOamhq956qqqkJhYSESExO1Yw4ODoiKikJe3p1Psbt+/Trat28PjUaDhx56CAsWLMCDDz7Y4LEqlQoqlUr7uLy8trtdrVZDrVbrnfVu6uYy5pymZouZAdvMzczmwczmwcxAVVW1ztxqmfEvesPPueG59SETQtpliIQQOHDgAHJycpCTk4Pdu3ejvLwcvXr1QmRkJJYuXar3XOfOnYOfnx/27NmDiIgI7XhCQgJyc3ORn59f7zV5eXn47bff0KtXL5SVlWHx4sX4/vvvcejQoQbP1pozZw5SUlLqjW/atAkuLi56ZyUiIgJqL9731v8ccbaytiViUe9qKHgmuMlVVlYiJiYGZWVlcHNza/RYycWNh4cHrl+/jqCgIERGRqJ///7o168f3N3dJQc1pLi5nVqtRrdu3fDcc89h3rx59Z5vaOXG398fly5duuuHI4VarUZWVhYGDhwIJycno81rSraYGbDN3MxsHsxsHk09c2VVNYLm7QIAdPNxxZcT/88kvZ9N/XO+XXl5OTw9PfUqbiRvS3344Yfo16+fUQoDT09PODo6orS0VGe8tLQUPj4+es3h5OSEkJAQHD9+vMHnFQoFFApFg68zxS+LqeY1JVvMDNhmbmY2D2Y2j6aa2Un8Vch8NqEP5HLJf0qlvV8T/ZwbmlNfkhuKhwwZoi1s/vjjD/zxxx9Sp9CSy+UIDQ1Fdna2dkyj0SA7O1tnJacxNTU1OHjwIHx9fQ3OQUREZAierGudJBc3Go0Gc+fORatWrdC+fXu0b98e7u7umDdvHjQajeQASqUSq1atwoYNG3DkyBFMmDABFRUViIuLA1B7L6tbG47nzp2LnTt34uTJk9i/fz9eeOEF/P777xg3bpzk9yYiIiL7I3ktbebMmVizZg0WLlyIvn37AgB2796NOXPm4ObNm3jjjTckzTdq1ChcvHgRSUlJKCkpQXBwMDIzM7WndxcXF8PB4a8a7M8//8T48eNRUlICDw8PhIaGYs+ePejevbvUH4WIiEgy3g3c+kkubjZs2IDVq1dj2LBh2rFevXrBz88PEydOlFzcAEB8fDzi4+MbfC4nJ0fn8dKlSyWdkUVERGQsvBu4bZC8LXXlyhV07dq13njXrl1x5coVo4QiIiKyRrwbuG2QXNwEBQVhxYoV9cZXrFiBoKAgo4QiIiKydrwbuPWSvC21aNEiDBkyBN9++632jKa8vDycOXMGO3bsMHpAIiIia8S6xnpJXrmJjIzEr7/+iieffBJXr17F1atX8dRTT+HYsWPo16+fKTISERFZBTYT2waDrjzUrl07gxqHiYiIbBWbiW2HXsXN//73P70n7NWrl8FhiIiIrBWbiW2HXsVNcHAwZDIZ7nYbKplMJumu4ERERLaIzcTWTa/i5tSpU6bOQUREZDNY11g3vYqb9u3bmzoHERGRVWMzse2QfLYUAGzcuBF9+/ZFu3bt8PvvvwMA0tPT8eWXXxo1HBERkTVgM7FtkVzcvPvuu1AqlRg8eDCuXr2q7bFxd3dHenq6sfMRERFZHJuJbYvk4mb58uVYtWoVZs6cCUfHv/5xw8LCcPDgQaOGIyIisjZsJrZ+koubU6dOISQkpN64QqFARUWFUUIRERFZK9Y11k9ycRMYGIiioqJ645mZmejWrZsxMhEREVkVNhPbFslXKFYqlZg0aRJu3rwJIQQKCgrw8ccfIzU1FatXrzZFRiIiIothM7HtkVzcjBs3Ds2bN8esWbNQWVmJmJgYtGvXDsuWLcOzzz5rioxEREQWw2Zi22PQvaWef/55PP/886isrMT169fh5eVl7FxERERWh83EtsGg4gYALly4gGPHjgGove1C27ZtjRaKiIjIGrGusQ2SG4qvXbuG0aNHo127doiMjERkZCTatWuHF154AWVlZabISERERKQ3ycXNuHHjkJ+fj+3bt+Pq1au4evUqvv76a/z000/417/+ZYqMRERERHqTvC319ddf45tvvsHDDz+sHYuOjsaqVavw+OOPGzUcERERkVSSV27atGmDVq1a1Rtv1aoVPDw8jBKKiIjIWvAaN7ZHcnEza9YsKJVKlJSUaMdKSkrw2muvYfbs2UYNR0REZEm8xo1t0mtbKiQkROfUt99++w33338/7r//fgBAcXExFAoFLl68yL4bIiKyG7zGjW3Sq7gZMWKEiWMQERFZN17jxnboVdwkJyebOgcREZFVY11jOyT33JjCypUrERAQAGdnZ4SHh6OgoECv123evBkymYwrS0REZBJsJrZNFi9utmzZAqVSieTkZOzfvx9BQUGIjo7GhQsXGn3d6dOnMW3aNPTr189MSYmIqClhM7Htsnhxk5aWhvHjxyMuLg7du3dHRkYGXFxcsHbt2ju+pqamBs8//zxSUlLQoUMHM6YlIqKmgs3Etsvge0sZQ1VVFQoLC5GYmKgdc3BwQFRUFPLy7lwtz507F15eXnjppZfwww8/NPoeKpUKKpVK+7i8vPYXVa1WQ61W3+NP8Je6uYw5p6nZYmbANnMzs3kws3k0lcxqdbX2+00vhaG6urqRo42vqXzOUufWh0wIaTuKc+fOxbRp0+Di4qIzfuPGDbz11ltISkrSe65z587Bz88Pe/bsQUREhHY8ISEBubm5yM/Pr/ea3bt349lnn0VRURE8PT3x4osv4urVq9i2bVuD7zFnzhykpKTUG9+0aVO9n4GIiKiOqgZIKKhdA1jUuxoKLtxYVGVlJWJiYlBWVgY3N7dGj5Vc3Dg6OuL8+fPw8vLSGb98+TK8vLxQU1Oj91xSi5tr166hV69eeOedd/DEE08AwF2Lm4ZWbvz9/XHp0qW7fjhSqNVqZGVlYeDAgXBycjLavKZki5kB28zNzObBzObRVDJXqKoRPH8XAODn2Y/BRW7ezY6m8jnrq7y8HJ6ennoVN5L/pYQQDZ7n//PPP6N169aS5vL09ISjoyNKS0t1xktLS+Hj41Pv+BMnTuD06dMYOnSodkyj0QAAmjVrhmPHjqFjx446r1EoFFAoFPXmcnJyMskvi6nmNSVbzAzYZm5mNg9mNg97ziyEQMw7e297nWU6Oez5c5Y6p770/pfy8PCATCaDTCZDly5ddAqcmpoaXL9+HS+//LKkoHK5HKGhocjOztaezq3RaJCdnY34+Ph6x3ft2hUHDx7UGZs1axauXbuGZcuWwd/fX9L7ExERNYTNxLZN7+ImPT0dQgiMHTsWKSkpOjfPlMvlCAgI0Nla0pdSqURsbCzCwsLQu3dvpKeno6KiAnFxcQCAMWPGwM/PD6mpqXB2dkaPHj10Xu/u7g4A9caJiIiMgVcmtj16FzexsbEAgMDAQPTt2xfNmhlneW7UqFG4ePEikpKSUFJSguDgYGRmZsLb2xtA7X2rHBwsfsY6ERE1UaxrbI/kCiUyMhInTpzAunXrcOLECSxbtgxeXl7473//i/vvvx8PPvig5BDx8fENbkMBQE5OTqOvXb9+veT3IyIiagyvTGzbJC+J5ObmomfPnsjPz8fWrVtx/fp1ALUNxbwHFRER2Tpemdj2SS5upk+fjvnz5yMrKwtyuVw7/thjj2Hv3r2NvJKIiMj6sZnY9kkubg4ePIgnn3yy3riXlxcuXbpklFBERETWgM3EtklycePu7o7z58/XGz9w4AD8/PyMEoqIiMgasK6xTZKLm2effRavv/46SkpKIJPJoNFo8OOPP2LatGkYM2aMKTISERGZDZuJbZ/k4mbBggXo2rUr/P39cf36dXTv3h2PPPII+vTpg1mzZpkiIxERkVmwmdg+SD4VXC6XY9WqVUhKSsLBgwdx/fp1hISEoHPnzqbIR0REZDZsJrYPBl+Jz9/fH/7+/qipqcHBgwfx559/wsPDw5jZiIiILIbNxLZL8rbUK6+8gjVr1gCovadUZGQkHnroIfj7+9/1gntERES2gnWN7ZJc3Hz22WcICgoCAHz11Vc4efIkjh49iqlTp2LmzJlGD0hERGQubCa2D5KLm0uXLsHHxwcAsGPHDowcORJdunTB2LFj692xm4iIyFawmdh+SC5uvL29cfjwYdTU1CAzMxMDBw4EAFRWVsLRkY1XRERkm9hMbD8kNxTHxcVh5MiR8PX1hUwmQ1RUFAAgPz8fXbt2NXpAIiIic2MzsW2TXNzMmTMHPXr0wJkzZ/DMM89AoVAAABwdHTF9+nSjByQiIjI31jW2zaBTwf/xj3/UG4uNjb3nMERERET3yqDipqKiArm5uSguLkZVVZXOc5MnTzZKMCIiIiJDSC5uDhw4gMGDB6OyshIVFRVo3bo1Ll26BBcXF3h5ebG4ISIim8TTwO2H5LOlpk6diqFDh+LPP/9E8+bNsXfvXvz+++8IDQ3F4sWLTZGRiIjIpHgauH2RXNwUFRXh1VdfhYODAxwdHaFSqeDv749FixZhxowZpshIRERkUjwN3L5ILm6cnJzg4FD7Mi8vLxQXFwMAWrVqhTNnzhg3HRERkZnxNHDbJ7nnJiQkBPv27UPnzp0RGRmJpKQkXLp0CRs3bkSPHj1MkZGIiMhsWNfYPskrNwsWLICvry8A4I033oCHhwcmTJiAixcv4r333jN6QCIiIlNjM7F9kbxyExYWpv3ey8sLmZmZRg1ERERkTmwmtj+SV24ee+wxXL16td54eXk5HnvsMWNkIiIiMhs2E9sfycVNTk5OvQv3AcDNmzfxww8/GCUUERGRJbCZ2D7ovS31v//9T/v94cOHUVJSon1cd4dwPz8/46YjIiIysVv7bVjX2Ae9V26Cg4MREhICmUyGxx57DMHBwdqv0NBQzJ8/H0lJSQaFWLlyJQICAuDs7Izw8HAUFBTc8ditW7ciLCwM7u7uaNGiBYKDg7Fx40aD3peIiJo29tvYJ71Xbk6dOgUhBDp06ICCggK0bdtW+5xcLoeXlxccHaXvU27ZsgVKpRIZGRkIDw9Heno6oqOjcezYMXh5edU7vnXr1pg5cya6du0KuVyOr7/+GnFxcfDy8kJ0dLTk9ycioqaL/Tb2Se/ipn379gAAjUZj1ABpaWkYP3484uLiAAAZGRnYvn071q5di+nTp9c7vn///jqPp0yZgg0bNmD37t0sboiIyGDst7EfBt0VHKjtu2noruDDhg3Te46qqioUFhYiMTFRO+bg4ICoqCjk5d19mVAIgV27duHYsWN48803GzxGpVJBpVJpH5eX11boarUaarVa76x3UzeXMec0NVvMDNhmbmY2D2Y2D3vKrFZXa7+vrlZD7WA9F7yxp8/ZmHPrQyaEtEsXnTx5Ek8++SQOHjwImUyGupfXVbs1NTV6z3Xu3Dn4+flhz549iIiI0I4nJCQgNzcX+fn5Db6urKwMfn5+UKlUcHR0xDvvvIOxY8c2eOycOXOQkpJSb3zTpk1wcXHROysREdkfVQ2QUFD73/mLeldDwV0pq1VZWYmYmBiUlZXBzc2t0WMlr9xMmTIFgYGByM7ORmBgIAoKCnD58mW8+uqrZrsruKurK4qKinD9+nVkZ2dDqVSiQ4cO9basACAxMRFKpVL7uLy8HP7+/hg0aNBdPxwp1Go1srKyMHDgQDg5ORltXlOyxcyAbeZmZvNgZvOwl8xCCAx/Zy+AawCA6OhBcJEbvKFhdPbyORtL3c6LPiT/K+bl5WHXrl3w9PSEg4MDHBwc8PDDDyM1NRWTJ0/GgQMH9J7L09MTjo6OKC0t1RkvLS2Fj4/PHV/n4OCATp06Aag9i+vIkSNITU1tsLhRKBRQKBT1xp2cnEzyy2KqeU3JFjMDtpmbmc2Dmc3D1jNXVlXjSEltYdPd1w1uLs5W2XNj65+zMefUl+SL+NXU1MDV1RVAbXFy7tw5ALUNx8eOHZM0l1wuR2hoKLKzs7VjGo0G2dnZOttUd6PRaHT6aoiIiKRgM7F9kbxy06NHD/z8888IDAxEeHg4Fi1aBLlcjvfffx8dOnSQHECpVCI2NhZhYWHo3bs30tPTUVFRoT17asyYMfDz80NqaioAIDU1FWFhYejYsSNUKhV27NiBjRs34t1335X83kRE1HTx4n32S3JxM2vWLFRUVAAA5s6di7///e/o168f2rRpgy1btkgOMGrUKFy8eBFJSUkoKSlBcHAwMjMz4e3tDQAoLi6Gg8NfC0wVFRWYOHEi/vjjDzRv3hxdu3bFhx9+iFGjRkl+byIiapp48T77Jrm4ufVaMp06dcLRo0dx5coVeHh4GLykFx8fj/j4+Aafy8nJ0Xk8f/58zJ8/36D3ISIiAnjxPntnlLbw1q1bG2MaIiIis2O/jf3Rq7h56qmn9J5w69atBochIiIyN9Y19kevs6VatWql/XJzc0N2djZ++ukn7fOFhYXIzs5Gq1atTBaUiIiISB96rdysW7dO+/3rr7+OkSNHIiMjQ3ujzJqaGkycONGoF8UjIiIiMoTk69ysXbsW06ZN07kDuKOjI5RKJdauXWvUcERERKYg7cZDZGskFzfV1dU4evRovfGjR48a/Y7hRERExsbTwO2f5LOl4uLi8NJLL+HEiRPo3bs3ACA/Px8LFy7UXniPiIjIWvE0cPsnubhZvHgxfHx8sGTJEpw/fx4A4Ovri9deew2vvvqq0QMSERGZCk8Dt0+SixsHBwckJCQgISFBe4dONhITEZGt4G0X7N89XcSPRQ0REdmS2n6bvZaOQSYmuaGYiIjIVrHfpmlgcUNERE0S+23sF4sbIiJqMthv0zToVdy0bt0aly5dAgCMHTsW165dM2koIiIiYxMCeG71PkvHIDPQq7ipqqrSnhm1YcMG3Lx506ShiIiIjK1KAxwpqf2Pc/bb2De9zpaKiIjAiBEjEBoaCiEEJk+ejObNmzd4LG/BQERE1o79NvZNr+Lmww8/xNKlS3HixAnIZDKUlZVx9YaIiGwW6xr7pldx4+3tjYULFwIAAgMDsXHjRrRp08akwYiIiIxFCIFlv3AbqqmQfBG/U6dOmSIHERGRydxQ1+BsZe1yDftt7J9Bp4Ln5uZi6NCh6NSpEzp16oRhw4bhhx9+MHY2IiIio2O/jf2TXNx8+OGHiIqKgouLCyZPnqxtLh4wYAA2bdpkioxERET3hNe3aVokb0u98cYbWLRoEaZOnaodmzx5MtLS0jBv3jzExMQYNSAREdG9EELw+jZNjOSVm5MnT2Lo0KH1xocNG8Z+HCIisjo31DXa69t083Flv00TILm48ff3R3Z2dr3xb7/9Fv7+/kYJRUREZAofj/sb+22aAMnbUq+++iomT56MoqIi9OnTBwDw448/Yv369Vi2bJnRAxIRERkL65qmQXJxM2HCBPj4+GDJkiX45JNPAADdunXDli1bMHz4cKMHJCIiuhe3NhNT02DQqeBPPvkkdu/ejcuXL+Py5cvYvXv3PRU2K1euREBAAJydnREeHo6CgoI7Hrtq1Sr069cPHh4e8PDwQFRUVKPHExFR0yWEwDMZeZaOQWZmUHFjTFu2bIFSqURycjL279+PoKAgREdH48KFCw0en5OTg+eeew7fffcd8vLy4O/vj0GDBuHs2bNmTk5ERNbuhroGh8/X3vjZz0WwmbiJsHhxk5aWhvHjxyMuLg7du3dHRkYGXFxc7ngDzo8++ggTJ05EcHAwunbtitWrV0Oj0TTY5ExERFRnSo8aNhM3EZJ7boypqqoKhYWFSExM1I45ODggKioKeXn6LSNWVlZCrVajdevWDT6vUqmgUqm0j8vLayt4tVoNtVp9D+l11c1lzDlNzRYzA7aZm5nNg5nNw5YyV1VV6zy2hcx1bOlzrmPKzFLmlAlhuVarc+fOwc/PD3v27EFERIR2PCEhAbm5ucjPz7/rHBMnTsQ333yDQ4cOwdnZud7zc+bMQUpKSr3xTZs2wcXF5d5+ACIislpCAG/9z1F7T6lFvauh4K6UzaqsrERMTAzKysrg5ubW6LH3tHJTVxdZaplv4cKF2Lx5M3JychosbAAgMTERSqVS+7i8vFzbp3O3D0cKtVqNrKwsDBw4EE5OTkab15RsMTNgm7mZ2TyY2TxsJXNlVTVe2bsLANDVpyXkDletPvOtbOVzvpUpM9ftvOjDoOLmgw8+wFtvvYXffvsNANClSxe89tprGD16tKR5PD094ejoiNLSUp3x0tJS+Pj4NPraxYsXY+HChfj222/Rq1evOx6nUCigUCjqjTs5OZnkl8VU85qSLWYGbDM3M5sHM5uHtWdupvnrP7w3j+uN3OydVp+5Icz815z6ktxQnJaWhgkTJmDw4MH45JNP8Mknn+Dxxx/Hyy+/jKVLl0qaSy6XIzQ0VKcZuK45+NZtqtstWrQI8+bNQ2ZmJsLCwqT+CEREZOduPwWcfcRNi+SVm+XLl+Pdd9/FmDFjtGPDhg3Dgw8+iDlz5ujcUFMfSqUSsbGxCAsLQ+/evZGeno6KigrExcUBAMaMGQM/Pz+kpqYCAN58800kJSVh06ZNCAgIQElJCQCgZcuWaNmypdQfh4iI7NCtp4B393XjKeBNjOTi5vz589rbLtyqT58+OH/+vOQAo0aNwsWLF5GUlISSkhIEBwcjMzMT3t7eAIDi4mI4OPy1wPTuu++iqqoK//jHP3TmSU5Oxpw5cyS/PxER2Z9bT5X59OUIyGS8THFTIrm46dSpEz755BPMmDFDZ3zLli3o3LmzQSHi4+MRHx/f4HM5OTk6j0+fPm3QexARUdPALSmSXNykpKRg1KhR+P7779G3b18AtTfOzM7O1t5rioiIyFIa2pKqrq6+y6vInkhuKH766aeRn58PT09PbNu2Ddu2bYOnpycKCgrw5JNPmiIjERGR3upvSXHppqkx6FTw0NBQfPjhh8bOQkREdE+4JUWAnsVNeXm59oJ3d7uIjjEvjEdERCQFz5IiQM/ixsPDA+fPn4eXlxfc3d0bXOITQkAmk6GmpsboIYmIiKTillTTpVdxs2vXLu2NKb/77juTBiIiIjIG1jVNl17FTWRkpPb7wMBA+Pv716uGhRA4c+aMcdMRERFJYLlbQZM1kXy2VGBgIC5evFhv/MqVKwgMDDRKKCIiIqlubyampktycVPXW3O769ev3/HO3ERERKbGZmKqo/ep4EqlEgAgk8kwe/ZsuLi4aJ+rqalBfn4+goODjR6QiIhIH7y+DdXRu7g5cOAAgNqVm4MHD0Iul2ufk8vlCAoKwrRp04yfkIiI6C54fRu6ld7FTd1ZUnFxcVi2bBmvZ0NERFaDW1J0K8lXKF63bp0pchARERmMW1J0K4Nuv/DTTz/hk08+QXFxMaqqqnSe27p1q1GCERER6YNbUnQ7yWdLbd68GX369MGRI0fwxRdfQK1W49ChQ9i1axdatWplioxERER3VFnFLSnSJbm4WbBgAZYuXYqvvvoKcrkcy5Ytw9GjRzFy5Ejcf//9pshIRETUoNtXbbglRYABxc2JEycwZMgQALVnSVVUVEAmk2Hq1Kl4//33jR6QiIjoTm5vJHaRc9WGDChuPDw8cO3aNQCAn58ffvnlFwDA1atXUVlZadx0REREjWAjMTVEckPxI488gqysLPTs2RPPPPMMpkyZgl27diErKwsDBgwwRUYiIqJ62EhMdyK5uFmxYgVu3rwJAJg5cyacnJywZ88ePP3005g1a5bRAxIRETWE17ahO5Fc3LRu3Vr7vYODA6ZPn659fOPGDeOkIiIikoBbUnQryT03DVGpVEhLS+NdwYmIyGxu7bdhXUO30ru4UalUSExMRFhYGPr06YNt27YBqL1icWBgIJYuXYqpU6eaKicREZHW7f02RLfSe1sqKSkJ7733HqKiorBnzx4888wziIuLw969e5GWloZnnnkGjo7c7yQiItPjhfuoMXoXN59++ik++OADDBs2DL/88gt69eqF6upq/Pzzz9znJCIis+GF++hu9N6W+uOPPxAaGgoA6NGjBxQKBaZOncpfKCIiMqvbV2144T66nd7FTU1NDeRyufZxs2bN0LJly3sOsHLlSgQEBMDZ2Rnh4eEoKCi447GHDh3C008/jYCAAMhkMqSnp9/z+xMRke3gqg3pQ+9tKSEEXnzxRSgUCgDAzZs38fLLL6NFixY6x0m5K/iWLVugVCqRkZGB8PBwpKenIzo6GseOHYOXl1e94ysrK9GhQwc888wzbF4mImqCuGpD+tC7uImNjdV5/MILL9zzm6elpWH8+PGIi4sDAGRkZGD79u1Yu3atzvVz6vztb3/D3/72NwBo8HkiIrJfXLUhfeld3Kxbt86ob1xVVYXCwkIkJiZqxxwcHBAVFYW8POOd3qdSqaBSqbSPy8trK361Wg21Wm2096mby5hzmpotZgZsMzczmwczm4elMleoqrWrNt18XOEk0+idgZ+zeZgys5Q5ZULcehkk8zl37hz8/PywZ88eREREaMcTEhKQm5uL/Pz8Rl8fEBCAV155Ba+88kqjx82ZMwcpKSn1xjdt2gQXFxeDshMRkXkJAbz1P0ecraxdqVnUuxoK7kg1KZWVlYiJiUFZWRnc3NwaPVby7RdsTWJiIpRKpfZxeXk5/P39MWjQoLt+OFKo1WpkZWVh4MCBcHJyMtq8pmSLmQHbzM3M5sHM5mGJzJVV1Xhl7y4Atas2I/7+f5K2pPg5m4cpM9ftvOjDYsWNp6cnHB0dUVpaqjNeWloKHx8fo72PQqHQNkHfysnJySS/LKaa15RsMTNgm7mZ2TyY2TzMmbmZ5q9C5rMJfSCXG/bni5+zeZgis5T5jHJvKUPI5XKEhoYiOztbO6bRaJCdna2zTUVERE3b7Y3E7CGmu7HotpRSqURsbCzCwsLQu3dvpKeno6KiQnv21JgxY+Dn54fU1FQAtU3Ihw8f1n5/9uxZFBUVoWXLlujUqZPFfg4iIjId3mqBpLJocTNq1ChcvHgRSUlJKCkpQXBwMDIzM+Ht7Q0AKC4uhoPDX4tL586dQ0hIiPbx4sWLsXjxYkRGRiInJ8fc8YmIyMR4+jcZwuINxfHx8YiPj2/wudsLloCAAFjo5C4iIrIAXrSPDGGxnhsiIqLGaDQCf1++W/uYqzakLxY3RERkdYSoLWxOXaoAwFUbkobFDRERWZ1bt6MCPVvg638/zFUb0huLGyIisiq3b0d9/e+H4eDAwob0x+KGiIisBrejyBhY3BARkdXgdhQZA4sbIiKyCrdf04bbUWQoFjdERGQVeE0bMhYWN0REZHG8pg0ZE4sbIiKyKI1GYEBaLpuIyWhY3BARkcXcXtiwiZiMgcUNERFZxO2nfQd6tkC2MpJNxHTPWNwQEZFF3H7aNwsbMhYWN0REZHa8CjGZEosbIiIyKzYQk6k1s3QAIiJqGoQQqKyqqddnwwZiMjYWN0REZHJCCPwjIw+Fv/+pHWOfDZkKt6WIiMikhBC4XFGlU9h093VjYUMmw5UbIiIymbrG4bqzogDgp1lRaNNCzq0oMhkWN0REZHQN9dcAQFh7DxY2ZHIsboiIyKgaWq2paxx2kTuysCGTY3FDRET3TAiBG+oaCIF6qzXdfd14HRsyKxY3RERksLrtp2cy8nRWagCu1pDlsLghIiJJhBBQ1QAVqmrEvLO3XlEDcLWGLIvFDRER3dWt207/eHcvjpQ0Q0LBLp1juvu64dOXIyCTAc2duFpDlsPihoiIdNQVMn89RoPbTnXqihpuP5G1sIriZuXKlXjrrbdQUlKCoKAgLF++HL17977j8Z9++ilmz56N06dPo3PnznjzzTcxePBgMyYmIrJttxcwf403XsjU8XMR2K4cALncias0ZHUsXtxs2bIFSqUSGRkZCA8PR3p6OqKjo3Hs2DF4eXnVO37Pnj147rnnkJqair///e/YtGkTRowYgf3796NHjx4W+AmIiEzvTsWIYXPpV8Dcrm6Fprpaje+ydqKFohmcnCz+Z4SoHov/VqalpWH8+PGIi4sDAGRkZGD79u1Yu3Ytpk+fXu/4ZcuW4fHHH8drr70GAJg3bx6ysrKwYsUKZGRkmDX7reoa7CqrquEkbOO/YNTqapvLDNhmbmY2D3vNbGgxci9u7Z+pU7dCo3YQ4EINWTOLFjdVVVUoLCxEYmKidszBwQFRUVHIy8tr8DV5eXlQKpU6Y9HR0di2bVuDx6tUKqhUKu3jsrIyAMCVK1egVqvv8Sf4S3nFTUz7oQrTfvjaaHOaiy1mBmwzNzObBzPr5wHvllgz5qEGC5XmTo64eb1MZ+zG//+/arUalZWVuHz5MpycnEwf1AiY2TxMmfnatWsAahcT7saixc2lS5dQU1MDb29vnXFvb28cPXq0wdeUlJQ0eHxJSUmDx6empiIlJaXeeGBgoIGpiYjswxkA7RPvehiRVbl27RpatWrV6DEW35YytcTERJ2VHo1GgytXrqBNmzZGbYArLy+Hv78/zpw5Azc3N6PNa0q2mBmwzdzMbB7MbB7MbB7MrEsIgWvXrqFdu3Z3PdaixY2npyccHR1RWlqqM15aWgofH58GX+Pj4yPpeIVCAYVCoTPm7u5ueOi7cHNzs5lfwjq2mBmwzdzMbB7MbB7MbB7M/Je7rdjUcTD6O0sgl8sRGhqK7Oxs7ZhGo0F2djYiIiIafE1ERITO8QCQlZV1x+OJiIioabH4tpRSqURsbCzCwsLQu3dvpKeno6KiQnv21JgxY+Dn54fU1FQAwJQpUxAZGYklS5ZgyJAh2Lx5M3766Se8//77lvwxiIiIyEpYvLgZNWoULl68iKSkJJSUlCA4OBiZmZnapuHi4mI4OPy1wNSnTx9s2rQJs2bNwowZM9C5c2ds27bN4te4USgUSE5OrrcFZs1sMTNgm7mZ2TyY2TyY2TyY2XAyoc85VUREREQ2wqI9N0RERETGxuKGiIiI7AqLGyIiIrIrLG6IiIjIrrC4kWDlypUICAiAs7MzwsPDUVBQ0Ojxn376Kbp27QpnZ2f07NkTO3bsMFPSv0jJfOjQITz99NMICAiATCZDenq6+YLeQkrmVatWoV+/fvDw8ICHhweioqLu+u9iKlJyb926FWFhYXB3d0eLFi0QHByMjRs3mjFtLam/03U2b94MmUyGESNGmDZgA6RkXr9+PWQymc6Xs7OzGdPWkvo5X716FZMmTYKvry8UCgW6dOli9v/9kJK5f//+9T5nmUyGIUOGmDGx9M85PT0dDzzwAJo3bw5/f39MnToVN2/eNFPaWlIyq9VqzJ07Fx07doSzszOCgoKQmZlpxrTA999/j6FDh6Jdu3aQyWR3vK/jrXJycvDQQw9BoVCgU6dOWL9+vclzQpBeNm/eLORyuVi7dq04dOiQGD9+vHB3dxelpaUNHv/jjz8KR0dHsWjRInH48GExa9Ys4eTkJA4ePGi1mQsKCsS0adPExx9/LHx8fMTSpUvNlrWO1MwxMTFi5cqV4sCBA+LIkSPixRdfFK1atRJ//PGHVef+7rvvxNatW8Xhw4fF8ePHRXp6unB0dBSZmZlWm7nOqVOnhJ+fn+jXr58YPny4ecL+f1Izr1u3Tri5uYnz589rv0pKSqw6s0qlEmFhYWLw4MFi9+7d4tSpUyInJ0cUFRVZbebLly/rfMa//PKLcHR0FOvWrbPazB999JFQKBTio48+EqdOnRLffPON8PX1FVOnTrXazAkJCaJdu3Zi+/bt4sSJE+Kdd94Rzs7OYv/+/WbLvGPHDjFz5kyxdetWAUB88cUXjR5/8uRJ4eLiIpRKpTh8+LBYvny5Wf63jsWNnnr37i0mTZqkfVxTUyPatWsnUlNTGzx+5MiRYsiQITpj4eHh4l//+pdJc95KauZbtW/f3iLFzb1kFkKI6upq4erqKjZs2GCqiA2619xCCBESEiJmzZplingNMiRzdXW16NOnj1i9erWIjY01e3EjNfO6detEq1atzJSuYVIzv/vuu6JDhw6iqqrKXBHrudff56VLlwpXV1dx/fp1U0WsR2rmSZMmiccee0xnTKlUir59+5o0562kZvb19RUrVqzQGXvqqafE888/b9Kcd6JPcZOQkCAefPBBnbFRo0aJ6OhoEyYTgttSeqiqqkJhYSGioqK0Yw4ODoiKikJeXl6Dr8nLy9M5HgCio6PveLyxGZLZ0oyRubKyEmq1Gq1btzZVzHruNbcQAtnZ2Th27BgeeeQRU0bVMjTz3Llz4eXlhZdeeskcMXUYmvn69eto3749/P39MXz4cBw6dMgccQEYlvk///kPIiIiMGnSJHh7e6NHjx5YsGABampqrDbz7dasWYNnn30WLVq0MFVMHYZk7tOnDwoLC7XbQCdPnsSOHTswePBgq82sUqnqbas2b94cu3fvNmnWe2Gpv4UsbvRw6dIl1NTUaK+aXMfb2xslJSUNvqakpETS8cZmSGZLM0bm119/He3atav3/0ymZGjusrIytGzZEnK5HEOGDMHy5csxcOBAU8cFYFjm3bt3Y82aNVi1apU5ItZjSOYHHngAa9euxZdffokPP/wQGo0Gffr0wR9//GGOyAZlPnnyJD777DPU1NRgx44dmD17NpYsWYL58+ebI/I9//9hQUEBfvnlF4wbN85UEesxJHNMTAzmzp2Lhx9+GE5OTujYsSP69++PGTNmmCOyQZmjo6ORlpaG3377DRqNBllZWdi6dSvOnz9vjsgGudPfwvLycty4ccNk78vihuzGwoULsXnzZnzxxRcWaRqVytXVFUVFRdi3bx/eeOMNKJVK5OTkWDpWg65du4bRo0dj1apV8PT0tHQcvUVERGDMmDEIDg5GZGQktm7dirZt2+K9996zdLQ70mg08PLywvvvv4/Q0FCMGjUKM2fOREZGhqWj6WXNmjXo2bMnevfubekojcrJycGCBQvwzjvvYP/+/di6dSu2b9+OefPmWTraHS1btgydO3dG165dIZfLER8fj7i4OJ1bFFEti99byhZ4enrC0dERpaWlOuOlpaXw8fFp8DU+Pj6Sjjc2QzJb2r1kXrx4MRYuXIhvv/0WvXr1MmXMegzN7eDggE6dOgEAgoODceTIEaSmpqJ///6mjAtAeuYTJ07g9OnTGDp0qHZMo9EAAJo1a4Zjx46hY8eOVpW5IU5OTggJCcHx48dNEbEeQzL7+vrCyckJjo6O2rFu3bqhpKQEVVVVkMvlVpe5TkVFBTZv3oy5c+eaMmI9hmSePXs2Ro8erV1h6tmzJyoqKvDPf/4TM2fONHnBYEjmtm3bYtu2bbh58yYuX76Mdu3aYfr06ejQoYNJs96LO/0tdHNzQ/PmzU32viz39CCXyxEaGors7GztmEajQXZ2NiIiIhp8TUREhM7xAJCVlXXH443NkMyWZmjmRYsWYd68ecjMzERYWJg5ouow1met0WigUqlMEbEeqZm7du2KgwcPoqioSPs1bNgwPProoygqKoK/v7/VZW5ITU0NDh48CF9fX1PF1GFI5r59++L48ePa4hEAfv31V/j6+pq8sDE0c51PP/0UKpUKL7zwgqlj6jAkc2VlZb0Cpq6gFGa45eK9fM7Ozs7w8/NDdXU1Pv/8cwwfPtzUcQ1msb+FJm1XtiObN28WCoVCrF+/Xhw+fFj885//FO7u7trTSkePHi2mT5+uPf7HH38UzZo1E4sXLxZHjhwRycnJFjkVXEpmlUolDhw4IA4cOCB8fX3FtGnTxIEDB8Rvv/1mtZkXLlwo5HK5+Oyzz3RORb127ZrZMhuSe8GCBWLnzp3ixIkT4vDhw2Lx4sWiWbNmYtWqVVab+XaWOFtKauaUlBTxzTffiBMnTojCwkLx7LPPCmdnZ3Ho0CGrzVxcXCxcXV1FfHy8OHbsmPj666+Fl5eXmD9/vtVmrvPwww+LUaNGmS3nraRmTk5OFq6uruLjjz8WJ0+eFDt37hQdO3YUI0eOtNrMe/fuFZ9//rk4ceKE+P7778Vjjz0mAgMDxZ9//mm2zNeuXdP+nQAg0tLSxIEDB8Tvv/8uhBBi+vTpYvTo0drj604Ff+2118SRI0fEypUreSq4tVm+fLm4//77hVwuF7179xZ79+7VPhcZGSliY2N1jv/kk09Ely5dhFwuFw8++KDYvn27mRNLy3zq1CkBoN5XZGSk1WZu3759g5mTk5PNmllq7pkzZ4pOnToJZ2dn4eHhISIiIsTmzZutOvPtLFHcCCEt8yuvvKI91tvbWwwePNis1wQxJLMQQuzZs0eEh4cLhUIhOnToIN544w1RXV1t1ZmPHj0qAIidO3eaNeetpGRWq9Vizpw5omPHjsLZ2Vn4+/uLiRMnmrVQkJo5JydHdOvWTSgUCtGmTRsxevRocfbsWbPm/e677xr839y6nLGxsfX+Znz33XciODhYyOVy0aFDB7Nc/0gmhBnW34iIiIjMhD03REREZFdY3BAREZFdYXFDREREdoXFDREREdkVFjdERERkV1jcEBERkV1hcUNERER2hcUNERnV+vXr4e7ubukYOH36NGQyGYqKiu5pnv79++OVV17RPg4ICEB6evo9zQkAL774IkaMGHHP8xBRfSxuiJqYkpIS/Pvf/0aHDh2gUCjg7++PoUOH1rv/i6FGjRqFX3/91ShzNebUqVOIiYlBu3bt4OzsjPvuuw/Dhw/H0aNHAQD+/v44f/48evTocU/vs3XrVpPcKXrZsmVYv3699vHtRRQRGY53BSdqQk6fPo2+ffvC3d0db731Fnr27Am1Wo1vvvkGkyZN0hYG96J58+YmvdsvAKjVagwcOBAPPPAAtm7dCl9fX/zxxx/473//i6tXrwKovQmivncLb0zr1q3veY5b1dTUQCaToVWrVkadl4huYfIbPBCR1XjiiSeEn5+fuH79er3nbr2nzu+//y6GDRsmWrRoIVxdXcUzzzyjvZmfEEIUFRWJ/v37i5YtWwpXV1fx0EMPiX379gkhhFi3bp1o1aqV9tjk5GQRFBQkPvjgA9G+fXvh5uYmRo0aJcrLy7XH1NTUiAULFoiAgADh7OwsevXqJT799NM7/hx1N+07ffr0HY+pu1fagQMHhBB/3RMnMzNTBAcHC2dnZ/Hoo4+K0tJSsWPHDtG1a1fh6uoqnnvuOVFRUaGdJzIyUkyZMkX7uH379mLp0qXax0uWLBE9evQQLi4u4r777hMTJkzQuXFr3efx5Zdfim7duglHR0dx6tQpnXtzxcbG1rtXz8mTJ0XHjh3FW2+91eDPbs4b2hLZGm5LETURV65cQWZmJiZNmoQWLVrUe76uT0aj0WD48OG4cuUKcnNzkZWVhZMnT2LUqFHaY59//nncd9992LdvHwoLCzF9+nQ4OTnd8b1PnDiBbdu24euvv8bXX3+N3NxcLFy4UPt8amoqPvjgA2RkZODQoUOYOnUqXnjhBeTm5jY4X9u2beHg4IDPPvsMNTU1kj6HOXPmYMWKFdizZw/OnDmDkSNHIj09HZs2bcL27duxc+dOLF++XO/5HBwc8Pbbb+PQoUPYsGEDdu3ahYSEBJ1jKisr8eabb2L16tU4dOgQvLy8dJ5ftmwZIiIiMH78eJw/fx7nz5/H/fffj7Fjx2LdunU6x65btw6PPPIIOnXqJOnnJmpSLF1dEZF55OfnCwBi69atjR63c+dO4ejoKIqLi7Vjhw4dEgBEQUGBEEIIV1dXsX79+gZf39DKjYuLi85KzWuvvSbCw8OFEELcvHlTuLi4iD179ujM89JLL4nnnnvujjlXrFghXFxchKurq3j00UfF3LlzxYkTJ7TP32nl5ttvv9Uek5qaKgDovO5f//qXiI6O1j6+28rN7T799FPRpk0bnc8DgCgqKtI57va7qt/+PkIIcfbsWeHo6Cjy8/OFEEJUVVUJT0/PO372RFSLKzdETYQQQq/jjhw5An9/f/j7+2vHunfvDnd3dxw5cgQAoFQqMW7cOERFRWHhwoU4ceJEo3MGBATA1dVV+9jX1xcXLlwAABw/fhyVlZUYOHAgWrZsqf364IMPGp130qRJKCkpwUcffYSIiAh8+umnePDBB5GVldVoll69emm/9/b2houLCzp06KAzVpdNH99++y0GDBgAPz8/uLq6YvTo0bh8+TIqKyu1x8jlcp331Ve7du0wZMgQrF27FgDw1VdfQaVS4ZlnnpE8F1FTwuKGqIno3LkzZDKZUZqG58yZg0OHDmHIkCHYtWsXunfvji+++OKOx9++ZSWTyaDRaAAA169fBwBs374dRUVF2q/Dhw/js88+azSHq6srhg4dijfeeAM///wz+vXrh/nz5zf6mluzyGSyRrPdzenTp/H3v/8dvXr1wueff47CwkKsXLkSAFBVVaU9rnnz5pDJZHrNebtx48Zh8+bNuHHjBtatW4dRo0bBxcXFoLmImgoWN0RNROvWrREdHY2VK1eioqKi3vN1Zxl169YNZ86cwZkzZ7TPHT58GFevXkX37t21Y126dMHUqVOxc+dOPPXUU/V6Q/TVvXt3KBQKFBcXo1OnTjpft64e3Y1MJkPXrl0b/NlMpbCwEBqNBkuWLMH//d//oUuXLjh37pxBc8nl8gb7hwYPHowWLVrg3XffRWZmJsaOHXuvsYnsHosboiZk5cqVqKmpQe/evfH555/jt99+w5EjR/D2228jIiICABAVFYWePXvi+eefx/79+1FQUIAxY8YgMjISYWFhuHHjBuLj45GTk4Pff/8dP/74I/bt24du3boZlMnV1RXTpk3D1KlTsWHDBpw4cQL79+/H8uXLsWHDhgZfU1RUhOHDh+Ozzz7D4cOHcfz4caxZswZr167F8OHDDf58pOrUqRPUajWWL1+OkydPYuPGjcjIyDBoroCAAOTn5+P06dO4dOmSdvXI0dERL774IhITE9G5c2ftvxMR3RmLG6ImpEOHDti/fz8effRRvPrqq+jRowcGDhyI7OxsvPvuuwBqV0C+/PJLeHh44JFHHkFUVBQ6dOiALVu2AKj9Y3v58mWMGTMGXbp0wciRI/HEE08gJSXF4Fzz5s3D7NmzkZqaim7duuHxxx/H9u3bERgY2ODx9913HwICApCSkoLw8HA89NBDWLZsGVJSUjBz5kyDc0gVFBSEtLQ0vPnmm+jRowc++ugjpKamGjTXtGnT4OjoiO7du6Nt27YoLi7WPvfSSy+hqqoKcXFxxopOZNdkQt8uQyIisogffvgBAwYMwJkzZ+Dt7W3pOERWj8UNEZGVUqlUuHjxImJjY+Hj44OPPvrI0pGIbAK3pYiIrNTHH3+M9u3b4+rVq1i0aJGl4xDZDK7cEBERkV3hyg0RERHZFRY3REREZFdY3BAREZFdYXFDREREdoXFDREREdkVFjdERERkV1jcEBERkV1hcUNERER2hcUNERER2ZX/B7e2w9rkroZEAAAAAElFTkSuQmCC",
      "text/plain": [
       "<Figure size 640x480 with 1 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "\n",
    "plt.ecdf(x=similarity_across_dataset.keys(), weights=similarity_across_dataset.values())\n",
    "plt.xticks(np.linspace(0, 1, 11))\n",
    "plt.yticks(np.linspace(0, 1, 11))\n",
    "plt.xlabel(\"Cosine Similarity\")\n",
    "plt.ylabel(\"Ratio of dataset below the similarity score\")\n",
    "plt.grid()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "90340b55",
   "metadata": {},
   "source": [
    "## Identify Duplicates\n",
    "\n",
    "We will create a simple pipeline that now identifies duplicates and writes them out.\n",
    "\n",
    "See the [API Reference](https://docs.nvidia.com/nemo/curator/latest/apidocs/stages/stages.deduplication.semantic.identify_duplicates.html#stages.deduplication.semantic.identify_duplicates.IdentifyDuplicatesStage) for more information about the `IdentifyDuplicatesStage` class."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "53e5b2ae",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2025-09-16 13:49:55,497\tINFO worker.py:1630 -- Using address 127.0.1.1:6379 set in the environment variable RAY_ADDRESS\n",
      "2025-09-16 13:49:55,499\tINFO worker.py:1771 -- Connecting to existing Ray cluster at address: 127.0.1.1:6379...\n",
      "2025-09-16 13:49:55,506\tINFO worker.py:1942 -- Connected to Ray cluster. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n",
      "2025-09-16 13:49:55,520\tINFO worker.py:1630 -- Using address 127.0.1.1:6379 set in the environment variable RAY_ADDRESS\n",
      "2025-09-16 13:49:55,522\tINFO worker.py:1771 -- Connecting to existing Ray cluster at address: 127.0.1.1:6379...\n",
      "2025-09-16 13:49:55,522\tINFO worker.py:1789 -- Calling ray.init() again after it has already been called.\n"
     ]
    }
   ],
   "source": [
    "from nemo_curator.pipeline import Pipeline\n",
    "from nemo_curator.stages.deduplication.semantic import IdentifyDuplicatesStage\n",
    "from nemo_curator.stages.file_partitioning import FilePartitioningStage\n",
    "from nemo_curator.utils.file_utils import create_or_overwrite_dir\n",
    "\n",
    "duplicates_output_path = os.path.join(output_path, \"duplicates\")\n",
    "create_or_overwrite_dir(duplicates_output_path)\n",
    "\n",
    "identify_duplicates_pipeline = Pipeline(\n",
    "    name=\"identify_duplicates_pipeline\",\n",
    "    stages=[\n",
    "        FilePartitioningStage(\n",
    "            file_paths=pairwise_path,\n",
    "            # we select files per partition to be 1, because IdentifyDuplicates has default batch_size of 10\n",
    "            # this means it'll process 10 files at a time\n",
    "            files_per_partition=1,\n",
    "        ),\n",
    "        IdentifyDuplicatesStage(\n",
    "            output_path=duplicates_output_path,\n",
    "            eps=0.1,\n",
    "        ),\n",
    "    ],\n",
    ")\n",
    "\n",
    "identify_duplicates_out = identify_duplicates_pipeline.run()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b47ce381",
   "metadata": {},
   "source": [
    "#### Looking at Duplicates\n",
    "\n",
    "- `id` : This is a list of all IDs that are above our similarity threshold `eps`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "865a9273",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>115</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>268</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>273</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>280</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>323</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    id\n",
       "0  115\n",
       "1  268\n",
       "2  273\n",
       "3  280\n",
       "4  323"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.read_parquet(os.path.join(duplicates_output_path, os.listdir(duplicates_output_path)[0])).head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "88140f70",
   "metadata": {},
   "source": [
    "## Removing Duplicates\n",
    "\n",
    "We offer a simple `TextDuplicatesRemovalWorkflow` that can remove duplicates from a given input dataset and list of duplicates to remove.\n",
    "\n",
    "### Notes\n",
    "1. When running the removal workflow, we must specify the same input configuration as we did when we \"generated IDs\".\n",
    "2. In this tutorial that happened at the embedding generation step.\n",
    "3. Therefore it's required that we match the same arguments of filepath, filetype and `files_per_partition`/`blocksize`.\n",
    "4. This is required because IDs are generated by hashing the filenames in each task. If the filenames (and their partitioning) do not match exactly between steps, the ID Generator will not be able to find the correct IDs and will error out.\n",
    "\n",
    "### Performance\n",
    "If you notice OOMs during this stage, you can try using `RayDataActor`.\n",
    "\n",
    "### How `TextDuplicatesRemovalWorkflow` works\n",
    "1. It starts the ID Generator using `create_id_generator(filepath=...)`\n",
    "1. It runs a pipeline that does [`ParquetReader`, `TextDuplicatesRemovalStage`, `ParquetWriter`] (assuming input/output filetypes are Parquet)\n",
    "1. It kills the ID Generator using `kill_id_generator_actor`\n",
    "\n",
    "See the [API Reference](https://docs.nvidia.com/nemo/curator/latest/apidocs/stages/stages.text.deduplication.semantic.html#stages.text.deduplication.semantic.TextSemanticDeduplicationWorkflow) for more information about the `TextSemanticDeduplicationWorkflow` class."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "185aa65d",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2025-09-16 13:50:29,842\tINFO worker.py:1630 -- Using address 127.0.1.1:6379 set in the environment variable RAY_ADDRESS\n",
      "2025-09-16 13:50:29,845\tINFO worker.py:1771 -- Connecting to existing Ray cluster at address: 127.0.1.1:6379...\n",
      "2025-09-16 13:50:29,852\tINFO worker.py:1942 -- Connected to Ray cluster. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n",
      "2025-09-16 13:50:30,361\tINFO worker.py:1630 -- Using address 127.0.1.1:6379 set in the environment variable RAY_ADDRESS\n",
      "2025-09-16 13:50:30,364\tINFO worker.py:1771 -- Connecting to existing Ray cluster at address: 127.0.1.1:6379...\n",
      "2025-09-16 13:50:30,371\tINFO worker.py:1942 -- Connected to Ray cluster. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n",
      "2025-09-16 13:50:30,385\tINFO worker.py:1630 -- Using address 127.0.1.1:6379 set in the environment variable RAY_ADDRESS\n",
      "2025-09-16 13:50:30,387\tINFO worker.py:1771 -- Connecting to existing Ray cluster at address: 127.0.1.1:6379...\n",
      "2025-09-16 13:50:30,387\tINFO worker.py:1789 -- Calling ray.init() again after it has already been called.\n",
      "2025-09-16 13:50:44,900\tINFO worker.py:1630 -- Using address 127.0.1.1:6379 set in the environment variable RAY_ADDRESS\n",
      "2025-09-16 13:50:44,902\tINFO worker.py:1771 -- Connecting to existing Ray cluster at address: 127.0.1.1:6379...\n",
      "2025-09-16 13:50:44,909\tINFO worker.py:1942 -- Connected to Ray cluster. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n"
     ]
    }
   ],
   "source": [
    "from nemo_curator.stages.text.deduplication.removal_workflow import TextDuplicatesRemovalWorkflow\n",
    "\n",
    "duplicates_output_path = os.path.join(output_path, \"duplicates\")\n",
    "\n",
    "# The workflow starts from a new IdGenerator from the persisted id generator\n",
    "# This helps it assign the same IDs back to the same file\n",
    "# It is important the we read again the same dataset using the same file path and files_per_partition / blocksize arguments\n",
    "id_generator_path = os.path.join(output_path, \"semantic_id_generator.json\")\n",
    "\n",
    "removal_workflow = TextDuplicatesRemovalWorkflow(\n",
    "    input_path=input_path,\n",
    "    ids_to_remove_path=duplicates_output_path,\n",
    "    output_path=os.path.join(output_path, \"deduplicated\"),\n",
    "    # input args\n",
    "    input_filetype=input_filetype,\n",
    "    input_fields=[\"text\"],\n",
    "    input_files_per_partition=1,\n",
    "    # output args\n",
    "    output_filetype=output_filetype,\n",
    "    output_fields=[\"text\", \"_curator_dedup_id\"],\n",
    "    # id args\n",
    "    ids_to_remove_duplicate_id_field=\"id\",  # this is the field that contains the IDs of the duplicates\n",
    "    id_generator_path=id_generator_path,\n",
    ")\n",
    "\n",
    "removal_out = removal_workflow.run()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a850d116",
   "metadata": {},
   "source": [
    "### Looking at the Deduplicated Dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "e4ebf5c6",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>text</th>\n",
       "      <th>_curator_dedup_id</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Once upon a time, there was a big dog named Ma...</td>\n",
       "      <td>1150000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Once upon a time, there was a shy little girl ...</td>\n",
       "      <td>1150001</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Once upon a time, there was a little boy named...</td>\n",
       "      <td>1150002</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Once upon a time, there was a little bear who ...</td>\n",
       "      <td>1150005</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Once upon a time, there was a little girl name...</td>\n",
       "      <td>1150006</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                text  _curator_dedup_id\n",
       "0  Once upon a time, there was a big dog named Ma...            1150000\n",
       "1  Once upon a time, there was a shy little girl ...            1150001\n",
       "2  Once upon a time, there was a little boy named...            1150002\n",
       "5  Once upon a time, there was a little bear who ...            1150005\n",
       "6  Once upon a time, there was a little girl name...            1150006"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "deduplicated_path = os.path.join(output_path, \"deduplicated\")\n",
    "\n",
    "pd.read_parquet(os.path.join(deduplicated_path, os.listdir(deduplicated_path)[0])).head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8bd1c200",
   "metadata": {},
   "source": [
    "## Printing Statistics of the Dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "d175ae31",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of input rows\t: 2,119,719\n",
      "Number of output rows\t: 1,684,623\n",
      "Number of removed rows\t: 435,096\n",
      "Ratio of removed rows\t: 20.53%\n"
     ]
    }
   ],
   "source": [
    "number_of_input_rows = sum(task._stage_perf[1].num_items_processed for task in removal_out)\n",
    "number_of_output_rows = sum([task._stage_perf[2].num_items_processed for task in removal_out])\n",
    "number_of_removed_rows = sum([task._metadata.get(\"num_removed\") for task in removal_out])\n",
    "\n",
    "print(f\"Number of input rows\\t: {number_of_input_rows:,}\")\n",
    "print(f\"Number of output rows\\t: {number_of_output_rows:,}\")\n",
    "print(f\"Number of removed rows\\t: {number_of_removed_rows:,}\")\n",
    "print(f\"Ratio of removed rows\\t: {(number_of_removed_rows * 100 / number_of_input_rows):.2f}%\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e016262f",
   "metadata": {},
   "source": [
    "## Stop the Cluster"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "753c72a9",
   "metadata": {},
   "outputs": [
    {
     "ename": "",
     "evalue": "",
     "output_type": "error",
     "traceback": [
      "\u001b[1;31mThe Kernel crashed while executing code in the current cell or a previous cell. \n",
      "\u001b[1;31mPlease review the code in the cell(s) to identify a possible cause of the failure. \n",
      "\u001b[1;31mClick <a href='https://aka.ms/vscodeJupyterKernelCrash'>here</a> for more info. \n",
      "\u001b[1;31mView Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details."
     ]
    }
   ],
   "source": [
    "client.stop()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7bf3c46e",
   "metadata": {},
   "source": [
    "## Conclusion\n",
    "\n",
    "We broke down the semantic deduplication process into distinct steps - embedding generation, K-means clustering with pairwise similarity computation, duplicate identification, and final removal.\n",
    "\n",
    "We showed how to create, persist, and reuse the ID Generator across different workflow stages, enabling consistent ID assignment throughout the process.\n",
    "\n",
    "By analyzing the cosine similarity distribution across our dataset, we determined an appropriate `eps` threshold of 0.1, which resulted in removing ~20% of our data.\n",
    "\n",
    "This step-by-step approach provides users with fine-grained control over each stage of the deduplication process, making it suitable for production environments where different components may need to be optimized or scaled independently."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
